Feels like all of this local LLM stuff is definitely pushing people in the direc...

mysteria · on March 15, 2024

The Ollama backend llama.cpp definitely supports those older cards with the OpenCL and Vulkan backends, though performance is worse than ROCm or CUDA. In their Vulkan thread for instance I see people getting it working with Polaris and even Hawaii cards.

https://github.com/ggerganov/llama.cpp/pull/2059

Personally I just run it on CPU and several tokens/s is good enough for my purposes.

bradley13 · on March 15, 2024

No new hardware needed. I was shocked that Mixtral runs well on my laptop, which has a so-so mobile GPU. Mixtral isn't hugely fast, but definitely good enough!

hugozap · on March 15, 2024

I'm a happy user of Mistral on my Mac Air M1.

isoprophlex · on March 15, 2024

How many gbs of RAM do you have in your M1 machine?

hugozap · on March 15, 2024

isoprophlex · on March 15, 2024

Thanks, wow, amazing that you can already run a small model with so little ram. I need to buy a new laptop, guess more than 16 gb on a macbook isn't really needed

SparkyMcUnicorn · on March 15, 2024

I would advise getting as much RAM as you possibly can. You can't upgrade later, so get as much as you can afford.

Mine is 64GB, and my memory pressure goes into the red when running a quantized 70B model with a dozen Chrome tabs open.

evilduck · on March 15, 2024

I use several LLM models locally for chat UIs and IDE autocompletions like copilot (continue.dev).

Between Teams, Chrome, VS Code, Outlook, and now LLMs my RAM usage sits around 20-22GB. 16GB will be a bottleneck to utility.

TylerE · on March 15, 2024

I've run LLMs and some of the various image models on my M1 Studio 32GB without issue. Not as fast as my old 3080 card, but considering the Mac all in has about a 5th the power draw, it's a lot closer than I expected. I'm not sure of the exact details but there is clearly some secret sauce that allows it to leverage the onboard NN hardware.

dartos · on March 15, 2024

Mistral is _very_ small when quantized.

I’d still go with 16gbs

jonplackett · on March 15, 2024

Is it easy to set this up?

LoganDark · on March 15, 2024

Super easy. You can just head down to https://lmstudio.ai and pick up an app that lets you play around. It's not particularly advanced, but it works pretty well.

It's mostly optimized for M-series silicon, but it also technically works on Windows, and isn't too difficult to trick into working on Linux either.

glial · on March 15, 2024

Also, https://jan.ai is open source and worth trying out too.

LoganDark · on March 15, 2024

Looks super cool, though it seems to be missing a good chunk of features, like the ability to change the prompt format. (Just installed it myself to check out all the options.) All the other missing stuff I can see though is stuff that LM Studio doesn't have either (such as a notebook mode). If it has a good chat mode then that's good enough for most!

hugozap · on March 15, 2024

It is, it doesn't require any setup.

After installation:

> ollama run mistral:latest

superkuh · on March 15, 2024

llama.cpp added first class support for the RX 580 by implementing the vulkan backend. There are some issues on older kernel amdgpu code where a llm process VRAM is never reloaded if it gets kicked out to GTT (in 5.x kernels) but overall it's much faster than the clBLAST opencl implementation.

jmorgan · on March 15, 2024

The compatibility matrix is quite complex for both AMD and NVIDIA graphics cards, and completely agree: there is a lot of work to do, but the hope is to gracefully fall back to older cards.. they still speed up inference quite a bit when they do work!

XCSme · on March 15, 2024

My 1080ti runs ok even this 47B model: https://ollama.com/library/dolphin-mixtral