I've experimented with it, the reason I haven't yet added it is that I want deployment to be seamless, and it's not trivial to ship a binary that would (without extra fuss or configuration) efficiently support Metal and CUDA, plus download the models in a graceful way. This is of course possible, but still hard, and not clear if it's the right place to spend energy. I'm curious how you think about it - is your primary desire to work offline or avoid sending data to OpenAI? Or both?
The latter mostly. It's also free, uncensored, and can never disappear from under me.
FWIW, from my understanding llama.cpp is pretty easy to integrate and is reasonably fast for being API agnostic. Ollama embeds it, for example. No pressure, just pointing it out :)