To the extent that you're memory bandwidth limited you should be able to do mult... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		nullc on May 5, 2023 \| parent \| context \| favorite \| on: OpenLLaMA: An Open Reproduction of LLaMA To the extent that you're memory bandwidth limited you should be able to do multiple inferences at once --- latency stays high but getting multiple samplings can be extremely useful for many uses and can cover up somewhat for high latency.

aljungberg on May 8, 2023 [–]

To an extent, but memory bandwidth soon becomes a bottleneck there too. The hidden state and the KV cache are large so it becomes a matter of how fast you can move data in and out of your L2 cache. If you don’t have a unified memory pool it gets even worse.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact