Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Feels like all of this local LLM stuff is definitely pushing people in the direction of getting new hardware, since nothing like RX 570/580 or other older cards sees support.

On one hand, the hardware nowadays is better and more powerful, but on the other, the initial version of CUDA came out in 2007 and ROCm in 2016. You'd think that compute on GPUs wouldn't require the latest cards.



The Ollama backend llama.cpp definitely supports those older cards with the OpenCL and Vulkan backends, though performance is worse than ROCm or CUDA. In their Vulkan thread for instance I see people getting it working with Polaris and even Hawaii cards.

https://github.com/ggerganov/llama.cpp/pull/2059

Personally I just run it on CPU and several tokens/s is good enough for my purposes.


No new hardware needed. I was shocked that Mixtral runs well on my laptop, which has a so-so mobile GPU. Mixtral isn't hugely fast, but definitely good enough!


I'm a happy user of Mistral on my Mac Air M1.


How many gbs of RAM do you have in your M1 machine?


8gb


Thanks, wow, amazing that you can already run a small model with so little ram. I need to buy a new laptop, guess more than 16 gb on a macbook isn't really needed


I would advise getting as much RAM as you possibly can. You can't upgrade later, so get as much as you can afford.

Mine is 64GB, and my memory pressure goes into the red when running a quantized 70B model with a dozen Chrome tabs open.


I use several LLM models locally for chat UIs and IDE autocompletions like copilot (continue.dev).

Between Teams, Chrome, VS Code, Outlook, and now LLMs my RAM usage sits around 20-22GB. 16GB will be a bottleneck to utility.


I've run LLMs and some of the various image models on my M1 Studio 32GB without issue. Not as fast as my old 3080 card, but considering the Mac all in has about a 5th the power draw, it's a lot closer than I expected. I'm not sure of the exact details but there is clearly some secret sauce that allows it to leverage the onboard NN hardware.


Mistral is _very_ small when quantized.

I’d still go with 16gbs


Is it easy to set this up?


Super easy. You can just head down to https://lmstudio.ai and pick up an app that lets you play around. It's not particularly advanced, but it works pretty well.

It's mostly optimized for M-series silicon, but it also technically works on Windows, and isn't too difficult to trick into working on Linux either.


Also, https://jan.ai is open source and worth trying out too.


Looks super cool, though it seems to be missing a good chunk of features, like the ability to change the prompt format. (Just installed it myself to check out all the options.) All the other missing stuff I can see though is stuff that LM Studio doesn't have either (such as a notebook mode). If it has a good chat mode then that's good enough for most!


It is, it doesn't require any setup.

After installation:

> ollama run mistral:latest


llama.cpp added first class support for the RX 580 by implementing the vulkan backend. There are some issues on older kernel amdgpu code where a llm process VRAM is never reloaded if it gets kicked out to GTT (in 5.x kernels) but overall it's much faster than the clBLAST opencl implementation.


The compatibility matrix is quite complex for both AMD and NVIDIA graphics cards, and completely agree: there is a lot of work to do, but the hope is to gracefully fall back to older cards.. they still speed up inference quite a bit when they do work!


My 1080ti runs ok even this 47B model: https://ollama.com/library/dolphin-mixtral




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: