ojosilva's favorites | Hacker News

While I do not have any A100 handy right now I have an instance running on Genesis Cloud with 4x RTX 3090.

A quick, very unscientific, test using the oobabooba/text-generation-webui with some models I tried earlier gives me:

* oasst-sft-7-llama-30b (spread over 4x GPU): Output generated in 28.26 seconds (5.77 tokens/s, 163 tokens, context 55, seed 1589698825)

* llama-30b-4bit-128g (only using 1 GPU as it is so small): Output generated in 12.88 seconds (6.29 tokens/s, 81 tokens, context 308, seed 1374806153)

* llama-65b-4bit-128g (only using 2 GPU): Output generated in 33.36 seconds (3.81 tokens/s, 127 tokens, context 94, seed 512503086)

* llama (vanilla, using 4x GPU): Output generated in 5.75 seconds (4.69 tokens/s, 27 tokens, context 160, seed 1561420693)

They all feel fast enough for interactive use. If you do not have an interface that streams the output (so you can see it progressing) it might feel a bit weird if you often have to wait ~30s to get the whole output chunk.