Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yet it gets done for really big models. I can’t find the description for how Cerebras is doing Llama 405b for reference, but they are splitting by layers for that one (and why it’s not available right now).

https://cerebras.ai/blog/introducing-cerebras-inference-ai-a...



That's not quite the same thing. There are three layers in the memory hierarchy (there are more, but for the purpose of this discussion, three is sufficient)

- CPU RAM

- GPU RAM

- GPU SRAM

The grandposter was talking about moving layers between CPU and GPU RAM, while Cerebras store models entirely in SRAM.

CPU <-> GPU is _much_ slower than GPU RAM <-> GPU SRAM: the former will never be competitive with keeping the weights entirely on accelerator. Comparing the bandwidth numbers given in that article with the bandwidth of PCI-E should demonstrate the problem.

Larger models are split between multiple accelerators, but even then they avoid going through the CPU as much as possible and instead use a direct interconnect (NVLink, etc). The weights are uploaded ahead of time to ensure they don't need to be reloaded during active inference.

The considerations for training are different, but the general principle of avoiding CPU traffic still holds.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: