Although there are multiple bottlenecks, my understanding (and why at a certain ...

Ambix · on May 4, 2023

Yeah, it's really so bad on desktops.

With my LLaMA AVX implementation on 32bit floats [0] there no performance gain after 2 threads, so remaining 14 threads available are of no use, there no memory bandwidth to load them with work :)

[0] https://github.com/gotzmann/llama.go

nullc · on May 5, 2023

To the extent that you're memory bandwidth limited you should be able to do multiple inferences at once --- latency stays high but getting multiple samplings can be extremely useful for many uses and can cover up somewhat for high latency.

aljungberg · on May 8, 2023

To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.

logicchains · on May 3, 2023

Thank you, that makes sense. I had no idea that there was such a dramatic difference in memory bandwidth between desktop and server CPUs.

cjbprime · on May 3, 2023

The two-channel DDR5 in desktops can't even do two channels very well -- if you try to put 64GB RAM in (two dual-rank 32GB DIMMs) then you lose around 50% of the bandwidth compared to a single rank DIMM (e.g. from 8GHz to 4GHz speeds, and increased latency).