Introduction

I have an r720 with 40 Cores and an Nvidia GPU with 24GB of memory and a lot of cuda cores, but up until now, I only was really able to develop software and programs that utilize a single Core. And even though I do not quite understand this Benchmark: Desktop

sysbench --threads=1 cpu run
General statistics:
	total time:                      	10.0004s
	total number of events:          	19577

Latency (ms):
     	min:                                	0.50
     	avg:                                	0.51
     	max:                                	1.46
     	95th percentile:                    	0.52
     	sum:                             	9954.17

Threads fairness:
	events (avg/stddev):       	19577.0000/0.00
	execution time (avg/stddev):   9.9542/0.00

Server:

General statistics:
	total time:                      	10.0017s
	total number of events:          	7307

Latency (ms):
     	min:                                	0.99
     	avg:                                	1.37
     	max:                                	4.07
     	95th percentile:                    	1.58
     	sum:                             	9985.85

Threads fairness:
	events (avg/stddev):       	7307.0000/0.00
	execution time (avg/stddev):   9.9859/0.00

It looks to me that they perform similarly, maybe the Desktop is a bit better, but the big advantage the Desktop has compared to the Server is that I am already working on it and it does not take minutes to boot up, and is way quieter.

So up until now there was little incentive to run my own software on this machine.

The difference now is that this is the first Program I have ever written that is capable of using all CPU cores, which when run on my Desktop make it freeze, but on the server, where I am able to create an LXC with 30 is able to use these 30 without problems and runs way faster.(I think my Desktop has about overall only 16)

Multithreading in rust

I had to test around a bit and understand how the std works but overall I would say that it was rather easy to get the hang of it. I packed the code I wanted to run parallel into one funktion without mutables and everything else as references and gave it a non reference return type. Then I just passed the funktion, with all the references packed in Arcs into FnOnce of the thread::spawn closure and off it went. To gather the output I setup a Communication primitive, where whenever the funktion returns, send its data. And at the end I wrote a small loop that catches the data and adds it up to the final array.

    	for i in 0..num_chunks {
        		let mut file = Arc::clone(&file);
        		let tx = tx.clone();

        		let index_map = Arc::clone(&index_map);
        		let acceptable_types = Arc::clone(&acceptable_types);
        		thread::spawn(move || {
            	let mut chunk = vec![0; CHUNKSIZE.min(file_size as usize - i * CHUNKSIZE)];

            	file.seek(SeekFrom::Start((i * CHUNKSIZE) as u64)).unwrap();
            	let _amount = file.read(&mut chunk).unwrap();
            	let chunk = chunk.iter().map(|c| *c as char).collect();

            	tx.send(line_process(
                	&chunk,
                	&acceptable_types,
                	offset_back,
                	offset_front,
                	&index_map,
            	))
            	.unwrap();
        	});
    	}

	drop(tx);

	for received in rx {
    		final_sum += &received;
	}

The Memory leak

The Program, as is, does run, for a while depending on how much memory and CPU cores you give the LX container, but as I was watching the Proxmox interface I noticed that it crashed as soon as there was no free memory left. This is a classic Linux kernel function, to prevent the whole system from failing, it just stops the main Memory consumer, that is not essential for the operating system to run.

And as I was watching it again I also observed that the memory usage was steadily going up. even though, if you think about what the program should do from a meter perspective it should not. It should create all these threads, but the operating system should schedule them to only start working when the previous finished. So as soon as all the threads were running, the memory consumption should not rise, especially because I am adding the return values up and do not store them separately.

This definitely sounds like an Memory leak {{memroy_leak.svg}}

Observations

My initial understanding of the rust mindset was, looking at the differences between rust and C++, that everything that could be unwanted should never happen in rust. But as I was searching the web, memory leaks are not common, but theoretically no problem for the rust lang, so if you are writing multicore Programs, it is possible to create memory leaks. So I set on to a quest to find the leak and fix it. To help me, I installed a tool named heaptrack, that records a program and provides you with useful data amount the memory consumption of your application. This is the output.

total runtime: 12min42s
calls to allocation functions: 15,821,110 (20,736/s)
temporary memory allocations: 180,141 (1.14%, 236/s)
peak heap memory consumption: 79.5GB after 12min42s
peak RSS (including heaptrack overhead): 12.1GB
total memory leaked: 79.5GB

It is important to note that the Program crashed, and heaptrack counts all memory that is allocated after the Program stopped as leaked memory. What I found most fascinating about this diagram was the 79GB of leaked memory, even though the LXC the program was running in had only 22GB of memory. But looking at this, it was clear to me that the program was definitely memory leaking. I practically spent the rest of the day looking at what could cause memory leaks in rust, and how these scenarios found on the internet could apply to my application. ChatGPT also was not mutch help, as it was mostly telling me to use a ThreadPool, something I, because I was so sure I was facing a memory leak, dismissed as LLM hallucination.

Memory never leaked(mostly)

After hours of research, I thought that I might not have an memory leak after all, but I was moving data to the for loop the handles the signaling, that maybe the data send does not get dropped when the next signal comes in. So I set up a print message that prints to me how much memory the receiver is using.

	for received in rx {
println!("Memory: {:?}", std::mem::size_of_val(rx));
    	final_sum += &received;
	}

But it was never called. Bit odd isn’t it? So I started to setup print statements at different points of the Program and these were my observations: The loop runs without problems and wants to start many threads It takes a while until a thread that was called by the loop gets started, and then it takes a while until it is finished. But the receiver never runs. It crashes before all threads that could have been started get initialized

Most of these things lined up with the way I thought of the code, since it starts many threads and let’s the kernel handle when which ones really get executed. But I thought that the receiver works as soon as one thread is finished and sends its data. So it it me:

The Cause

The thread caller/iterator and the receiver were still running in parallel. The receiver would only start processing the messages as soon as all possible threads where set up, and the data/messanges had to be stored somewhere until all the threads where finished being registered by the kernel. The memory also never leaked, heaptrack just counts everything as leaked memory that is not deallocated by the program itself after it stopped running, and since the program crashed, it did not deallocate the memory used by the messages.

After looking around thinking if I should try a threadpool by rayon, I was convinced that I would first want to try out a quick and dirty solution, just through the thread spawner into it’s own thread:

	let thread_spawner = thread::spawn(move || {
    	for i in 0..num_chunks {
        	let mut file = Arc::clone(&file);
        	let tx = tx.clone();

        	print!("At {i} out of {num_chunks}\r");

        	let index_map = Arc::clone(&index_map);
        	let acceptable_types = Arc::clone(&acceptable_types);
        	thread::spawn(move || {
            	let mut chunk = vec![0; CHUNKSIZE.min(file_size as usize - i * CHUNKSIZE)];

            	file.seek(SeekFrom::Start((i * CHUNKSIZE) as u64)).unwrap();
            	let _amount = file.read(&mut chunk).unwrap();
            	let chunk = chunk.iter().map(|c| *c as char).collect();

            	tx.send(line_process(
                	&chunk,
                	&acceptable_types,
                	offset_back,
                	offset_front,
                	&index_map,
            	))
            	.unwrap();
        	});
    	}
	});

	for received in rx {
    	final_sum += &received;
	}
	thread_spawner.join().unwrap();

And it worked, amazingly.

total runtime: 2852.26s. ~ 47min
calls to allocation functions: 55836442 (19576/s)
temporary memory allocations: 638496 (223/s)
peak heap memory consumption: 9.60M
peak RSS (including heaptrack overhead): 64.33M
total memory leaked: 460.43K

total runtime: 2660.16s.
calls to allocation functions: 55813836 (20981/s)
temporary memory allocations: 849101 (319/s)
peak heap memory consumption: 4.46M
peak RSS (including heaptrack overhead): 69.92M
total memory leaked: 460.43K

Important to note, it did not crash, but there are still some bytes leaked, but not nearly as much as 79GB(Still no idea where they got this number from). Interestingly, initializing a heap but not running it does virtually cost zero memory resources, maybe because I mostly use references and so they practically don’t need any space, but it is nice to know.

Going forward

Multi threading is a lot of fun and it makes this giant cube, sitting next to me, really useful, so I will definitely try it out more often. I will also have to learn a bit about rayon, since it looks like a really handy tool for the whole multithreading stuff, but I think this project was a perfect starter and it is super cool watching top as it rates your program at %700 CPU power.

Multithreading and Memory leaks