Memory grows linearly, compute grows quadratically (but with small constant - un...

xiphias2 · 2025-03-25T18:11:12 1742926272

Also reusing key/values for different queries can compress the KV cache, it can be an 1000x or 10000x improvement in bandwidth if the model is trained for it.

mmoskal · 2025-03-25T21:37:33 1742938653

Just to clarify: simple prefix KV cache doesn't require any special model training. It does require the inference framework to support it, but most do by now.

You can see dramatic improvements in latency and throughput if there is a large shared prefix of the queries.