Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yep, and it deserves the credit! He who writes the cuda kernel (or translates it) controls the spice.

I had wrapped this and had it working in Ollama months ago as well: https://github.com/ollama/ollama/pull/814. I don't use Ollama anymore, but I really like the way they handle device memory allocation dynamically, I think they were the first to do this well.



I'm curious about both:

- what's special about the memory allocation, and how might it help me?

- what are you now using instead of ollama?


Ollama does a nice job of looking at how much VRAM the card has and tuning the number of gpu layers offloaded. Before that, I mainly just had to guess. It's still a heuristic, but I thought that was neat.

I'm mainly just using llama.cpp as a native library now, mainly for the direct access to more of llama's data structures, and because I have a sort of unique sampler setup.


Oh right... I've just been guessing, to try and find the value one fewer than the one which causes CUDA OOM errors.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: