Trying to figure out a bizarre performance drop in a merge sort I wrote for one ...

poorlyknit · on Sept 30, 2024

The fact that it happens

    1) at/around powers of two
    2) regardless whether using GC or preallocating

makes me think its got to do with cache sizes. For example, the M2 macbook has around 16MiB of cache, which gives you approximately 2 million ~ 2^20 nodes where each node has a 64 bit number and a 64 bit pointer.

The jumps in your measurements could be related to the cache hierarchy. On Linux (probably macOS too) you should be able to run "perf stat ./mergesort" to show you the hitrate of your CPU's caches.

I'd be interested in a follow-up post :)

bediger4000 · on Sept 30, 2024

Thanks for reading my blog post and thinking about it. I thought of cache as well, the 3 machines I'm running benchmarking on are all very different. The M2 Macbook has 128-byte cache lines, as opposed to the other 2 machines, which have 64-byte ache lines. The old Dell R530 has level 1 through 3 caches, the others have only levels 1 and 2. I'm still trying to understand CPU caching and see if I can correlate it to the performance drops somehow.

The linked list nodes are 16 bytes, and both 8-byte fields (.Next and .Data) get read and written during sorting. Up to a point, larger node sizes,which I would thing would change caching, don't change where the performance drops occur: https://bruceediger.com/posts/mergesort-investigation-7/

Sorting an already sorted, or reverse sorted list, doesn't exhibit the performance drops either: https://bruceediger.com/posts/mergesort-investigation-9/

A purely recursive mergesort doesn't show the performance drops: https://bruceediger.com/posts/mergesort-investigation-3/

justinclift · on Oct 1, 2024

Is the compiler, or a library you're using, somehow adding code that takes a different strategy at runtime when the linked list gets bigger than the performance-fall-off size?

bediger4000 · on Oct 1, 2024

Good question, but I don't think so. Two mergesort variants show it, one does not. A C transliteration of one of the variants that show it, also shows performance drops. The Go compiler for MacOS on M2 silicon probably generates very different code than the X86_64 on Linux, yet they both have the same performance drops.

The code only uses standard Go packages. I wrote the linked list struct, I'm not using a package's struct or code. I don't even use standard package code during sorting, only in timing and list preparation.

What you suggest may be true, but it seems very unlikely.

justinclift · on Oct 1, 2024

No worries. It does sound like something else is going on then. :)

kmoser · on Sept 30, 2024

Does it have to do with when garbage collection is triggered?

bediger4000 · on Sept 30, 2024

I don't think so. There's no garbage generated during the sort, which is what I'm timing. I'm sure that GC runs during the sorts, but it won't do significant amounts of work because of that.

A C-transliteration of my original Go algorithm shows similar overall performance, with performance drops at the same list lengths.

Forcing garbage collection right after each sort (releasing all references to the sorted linked list) didn't change anything, either: https://bruceediger.com/posts/mergesort-investigation-10/