Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

yes, not sure you can do better than that. You cannot still have one instance of LLM in (GPU) memory answer two queries at one time.


Of course, you can support concurrent requests. But Ollama doesn't support it and it's not meant for this purpose and that's perfectly ok. That's not the point though. For fast/perf scenarios, you're better off with vllm.


Thanks! This is great to know.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: