Ollama now supports AMD graphics cards

Zambyte · on March 15, 2024

It's pretty funny to see this blog post, when I have been running Ollama on my AMD RX 6650 for weeks :D

They have shipped ROCm containers since 0.1.27 (21 days ago). This blog post seems to be published along with the latest release, 0.1.29. I wonder what they actually changed in this release with regards to AMD support.

Also: see this issue[0] that I made where I worked through running Ollama on an AMD card that they don't "officially" support yet. It's just a matter of setting an environment variable.

[0] https://github.com/ollama/ollama/issues/2870

Edit: I did notice one change, now the starcoder2[1] model works now. Before that would crash[2].

[1] https://ollama.com/library/starcoder2

[2] https://github.com/ollama/ollama/issues/2953

mchiang · on March 15, 2024

While the PRs went in slightly earlier, much of the time was spent on testing the integrations, and working with AMD directly to resolve issues.

There were issues that we resolved prior to cutting the release, and many reported by the community as well.

Zambyte · on March 15, 2024

Thank you for clarifying and thanks for the great work you do!

cjbprime · on March 16, 2024

Maybe they wanted to wait for bug reports for 21 days before publishing a popular blog post about it..?

layoric · on March 16, 2024

What kind of performance do you see with an RX 6650?

throwaway5959 · on March 15, 2024

I mean, it was 21 days ago. What’s the difference?

yjftsjthsd-h · on March 15, 2024

2 versions, apparently

freedomben · on March 15, 2024

I'm thrilled to see support for RX 6800/6800 XT / 6900 XT. I bought one of those for an outrageous amount during the post-covid shortage in hopes that I could use it for ML stuff, and thus far it hasn't been very successful, which is a shame because it's a beast of a card!

Many thanks to ollama project and llama.cpp!

mey · on March 15, 2024

Sad to see that the cut off is just after 6700 XT which is what is in my desktop. They indicate more devices are coming, hopefully that includes some of the more modern all in one chips with RDNA 2/3 from AMD as well.

mey · on March 15, 2024

It appears that the cut off lines up with HIP SDK support from AMD, https://rocm.docs.amd.com/projects/install-on-windows/en/lat...

bavell · on March 16, 2024

I've been using my 6750XT for more than a year now on all sorts of AI projects. Takes a little research and a few env vars but no need to wait for "official" support most of the time.

throawayonthe · on March 15, 2024

I’ve already been using ollama with my 6700xt just fine, you just have to set some env variable to make rocm work “unoficially”

The linked page says they will support more soon, so i’m guessing this will just be integrated

Rising6378 · on March 16, 2024

which env variable did you set to what? currently I'm struggeling setting it up

Retr0id · on March 17, 2024

HSA_OVERRIDE_GFX_VERSION=10.3.0 seems to be the magic words.

You can invoke it directly via `HSA_OVERRIDE_GFX_VERSION=10.3.0 ollama serve`, but I added this line to the systemd unit at /etc/systemd/system/ollama.service:

    Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"

eclectic29 · on March 15, 2024

I'm not sure why Ollama garners so much attention. It has limited value - used for only experimenting with models + cannot support more than 1 model at a time. It's not meant for production deployments. Granted that it makes the experimentation process super easy but for something that relies on llama.cpp completely and whose main value proposition is easy model management I'm not sure it deserves the brouhaha people are giving it.

Edit: what do you do after the initial experimentation? you need to deploy these models eventually to production. I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers. Not denying that it's a great product.

kergonath · on March 16, 2024

> It's not meant for production deployments.

I am probably not the demographics you expect. I don’t do “production” in that sense, but I have ollama running quite often when I am working, as I use it for RAG and as a fancy knowledge extraction engine. It is incredibly useful:

- I can test a lot of models by just pulling them (very useful as progress is very fast),

- using their command line is trivial,

- the fact that it keeps running in the background means that it starts once every few days and stays out of the way,

- it integrates nicely with langchain (and a host of other libraries), which means that it is easy to set up some sophisticated process and abstract away the LLM itself.

> what do you do after the initial experimentation?

I just keep using it. And for now, I keep tweaking my scripts but I expect them to stabilise at some point, because I use these models to do some real work, and this work is not monkeying about with LLMs.

> I'm not even talking about giving credit to llama.cpp, just mentioning that this product is gaining disproportionate attention and kudos compared to the value it delivers.

For me, there is nothing that comes close in terms of integration and convenience. The value it delivers is great, because it enables me to do some useful work without wasting time worrying about lower-level architecture details. Again, I am probably not in the demographics you have in mind (I am not a CS person and my programming is usually limited to HPC), but ollama is very useful to me. Its reputation is completely deserved, as far as I am concerned.

rkwz · on March 16, 2024

> I use it for RAG and as a fancy knowledge extraction engine

Curious, can you share more details about your usecase?

kergonath · on March 16, 2024

The use case is exploratory literature review in a specific scientific field.

I have a setup that takes pdfs and does some OCR and layout detection with Amazon, and then bunch them with some internal reports. Then, I have a pipeline to write summaries of each document and another one to slice them into chunks, get embeddings and set up a vector store for a RAG chat bot. At the moment it’s using Mixtral and the command line. But I like being able to swap LLMs to experiments with different models and quantisation without hassle, and I more or less plan to set this up on a remote server to free some resources on my workstation so the web UI could come in handy. Running this locally is a must for confidentiality reasons. I’d like to get rid of Textract as well, but unfortunately I haven’t found a solution that’s even close. Tesseract in particular was very disappointing.

kergonath · on March 16, 2024

> layout detection with Amazon

Amazon Textract*.

It got lost in editing somehow

idncsk · on March 16, 2024

Try ollama webui(now open-webui). Sry on my phone now => no links

cbhl · on March 15, 2024

In my opinion, pre-built binaries and an easy-to-use front-end are things that should exist and are valid as a separate project unto themselves (see, e.g., HandBrake vs ffmpeg).

Using the name of the authors or the project you're building on can also read like an endorsement, which is not _necessarily_ desirable for the original authors (it can lead to ollama bugs being reported against llama.cpp instead of to the ollama devs and other forms of support request toil). Consider the third clause of BSD 3-Clause for an example used in other projects (although llama.cpp is licensed under MIT).

crooked-v · on March 15, 2024

> Granted that it makes the experimentation process super easy

That's the answer to your question. It may have less space than a Zune, but the average person doesn't care about technically superior alternatives that are much harder to use.

thejohnconway · on March 16, 2024

*Nomad

And lame.

nerdix · on March 15, 2024

The answer to your question is:

ollama run mixtral

That's it. You're running a local LLM. I have no clue how to run llama.cpp

I got Stable Diffusion running and I wish there was something like ollama for it. It was painful.

viraptor · on March 15, 2024

On a mac, https://drawthings.ai is the ollama of Stable Diffusion.

ghurtado · on March 15, 2024

For me, ComfyUI made the process of installing and playing with SD about as simple as a Windows installer.

jameshart · on March 15, 2024

The README is pretty clear, albeit it talks about a lot of optional steps you don’t need, but it’s essentially gonna be something like:

   git clone https://github.com/ggerganov/llama.cpp.git
   cd llama.cpp
   make
   wget https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF/resolve/main/mixtral-8x7b-v0.1.Q4_K_M.gguf?download=true
   ./main -m ./mixtral-8x7b-v0.1.Q4_K_M.gguf -n 128

ies7 · on March 16, 2024

For us this may like a walk in the park.

For non technical people there is a possibility their os don't have git, wget and c++ compiler (especially in windows)

This is just like dropbox case years ago.

verdverm · on March 15, 2024

This shows the value ollama provides

I only need to know the model name and then run a single command

hnfong · on March 16, 2024

The first 3 steps GP provided are literally just the steps for installation. The "value" you mentioned is just a packaged installer (or, in the case of Linux, apparently a `curl | sh` -- and I'd much prefer the git clone version).

On multiple occasions I've been modifying llama.cpp code directly and recompiling for my own purposes. If you're using ollama on the command line, I'd say having the option to easily do that is much more useful than saving a couple commands upon installation.

verdverm · on March 16, 2024

When I get to the point of modification, I will go with Python. This is where the AI ecosystem is largely at

I stopped using C++ when Go came out, no interest in ever having to write it again.

jameshart · on March 15, 2024

It should be fairly obvious that one can find alternative models and use them in the above command too.

Look, I’m not arguing that a prebuilt binary that handles model downloading has no value over a source build and manually pulling down gguf files. I just want to dispel some of the mystery.

Local LLM execution doesn’t require some mysterious voodoo that can only be done by installing and running a server runtime. It’s just something you can do by running code that loads a model file into memory and feeds tokens to it.

More programmers should be looking at llama.cpp language bindings than at Ollama’s implementation of the openAI api.

roenxi · on March 16, 2024

There are 5 commands in that README two comments up, 4 can reasonably fail (I'll give cd high marks for reliability). `make` especially is a minefield and usually involves a half-hour of searching the internet and figuring out which dependencies are a problem today. And that is all assuming someone is comfortable with compiled languages. I'd hazard most devs these days are from JS land and don't know how to debug make.

Finding the correct model weights is also a challenge in my experience, there are a lot of alternatives and it is often difficult to figure out what the differences are and whether they matter.

The README is clear that I'm probably about to lose an hour debugging if I follow it. It might be one of those rare cases where it works first time but that is the exception not the rule.

jameshart · on March 16, 2024

Your mileage may vary. It runs first time for me on an Apple Silicon Mac.

verdverm · on March 16, 2024

I'd rather focus on building on top of of LLMs than going lower level

Ollama makes that super easy. I tried llama.cpp first and hit build issues. Ollama worked out of the box

jameshart · on March 16, 2024

Sure.

Just be aware that there’s a lot of expressive difference between building on top of an HTTP API vs on top of a direct interface to the token sampler and model state.

verdverm · on March 16, 2024

I'm aware, I don't need that amount of sophistication yet.

Python seems to be the way to go deeper though. Is there a good reason I should be aware of to pick llama.cpp over python?

jameshart · on March 16, 2024

Python’s as good a choice as any for the application layer. You’re either going to be using PyTorch or llama-cpp-python to get the CUDA stuff working - both rely on native compiled C/C++ code to access GPUs and manage memory at the scale needed for LLMs. I’m not actually up to speed on the current state of the game there but my understanding is that llama.cpp’s less generic approach has allowed it to focus on specifically optimizing performance of llama-style LLMs.

verdverm · on March 16, 2024

I've seen more of the model fiddling, like logits restrictions and layer dropping, implemented in python, which is why I ask

Most of AI has centralized around Python, I see more of my code moving that way, like how I'm using LlamaIndex as my primary interface now, which supports ollama and many more model loaders / APIs

eclectic29 · on March 15, 2024

And what will you do after trying it? Sure, you saved a few mins in trying out a model or models. What next?

verdverm · on March 15, 2024

I focus on building the application rather than figuring out someone else preferred method for how I should work?

I use Docker Compose locally, Kubernetes in the cloud

I run in hot-reload locally, I build for production

I often nuke my database locally, but I run it HA in production

It is very rare to use the same technology locally (or the same way) as in production

imtringued · on March 16, 2024

There is no "next", there is a whole world of people running LLMs locally on their computer and they are far more likely to switch between models on a whim every few days.

ramblerman · on March 16, 2024

Relax. Not everything in this world was built exactly for you. You almost seem to have a problem with this.

read_if_gay_ · on March 16, 2024

>Hacker News

vidarh · on March 15, 2024

Last time I tried llama.cpp I got errors when running make that were way too time consuming to bother tracking down.

It's probably a simple build if everything is how it wants it, but it wasn't in my machine, while running ollama was.

imtringued · on March 16, 2024

The average user isn't going to compile llama.cpp. They will either download a fully integrated application that contains llama.cpp and is able to read gguf files directly, like kobold.cpp or they are going to use any arbitrary front end like Silly Tavern which needs to connect to an inference server via an API and ollama is one of the easier inference servers to install and use.

kergonath · on March 16, 2024

Compared to “ollama pull mixtral”? And then actually using the thing is easier as well.

cjbprime · on March 16, 2024

This will likely build a version without GPU acceleration, I think?

UncleEntity · on March 16, 2024

I was trying to get AMD GPU support going in llama.cpp a couple weeks ago and just gave up after a while. 'rocminfo' shows that I have a GPU and, presumably, rocm installed but there were build problems I didn't feel like sorting out just to play with a LLM for a bit.

Kudos if Ollama has this sorted out.

jameshart · on March 16, 2024

Builds with Metal support on my Mac M2

icelain · on March 16, 2024

Check out EasyDiffusion.

reustle · on March 15, 2024

Even for just running a model locally, Ollama provided a much simpler "one click install" earlier than most tools. That in itself is worth the support.

Aka456 · on March 15, 2024

Koboldcpp is also very, very good, plug and play, very complet web UI, nice little api with sse text streaming, vulkan accelerated, have an AMD fork...

davidhariri · on March 15, 2024

As it turns out, making it faster and better to manage things tends to get people’s attention. I think it’s well deserved.

andrewstuart · on March 15, 2024

Sounds like you are dismissing Ollama as a "toy".

Refer:

https://paulgraham.com/startupideas.html

ecnahc515 · on March 15, 2024

Ollama is the Docker of LLMs. Ollama made it _very_ easy to run LLMs locally. This is surprisingly not as easy as it seems, and incredibly useful.

brucethemoose2 · on March 15, 2024

Interestingly, Ollama is not popular at all in the "localllama" community (which also extends to related discords and repos).

And I think thats because of capabilities... Ollama is somewhat restrictive compared to other frontends. I have a littany of reasons I personally wouldn't run it over exui or koboldcpp, both for performance and output quality.

This is a necessity of being stable and one-click though.

Abishek_Muthian · on March 16, 2024

For me all the projects which enable running & fine-tuning LLMs locally like llama.cpp, ollama, open-webui, unsolth etc. play a very important part in democratizing AI.

> what do you do after the initial experimentation? you need to deploy these models eventually to production

I built GaitAnalyzer[1], to analyze my gait laptop; I had deployed it briefly in production when I had enough credits to foot the AWS GPU bills. Ollama made it very simple to deploy the application, Anyone who has used docker before can now run GaitAnalyzer in their computer.

[1] https://github.com/abishekmuthian/gaitanalyzer

Karrot_Kream · on March 15, 2024

I mean, it takes something difficult like an LLM and makes it easy to run. It's bound to get attention. If you've tried to get other models like BERT based models to run you'll realize just how big the usability gains are running ollama than anything else in the space.

If the question you're asking is why so many folks are focused on experimentation instead of productionizing these models, then I see where you're coming from. There's the question of how much LLMs are actually being used in prod scenarios right now as opposed to just excited people chucking things at them; that maybe LLMs are more just fun playthings than tools for production. But in my experience as HN has gotten bigger, the number of posters talking about productionizing anything has really gone down. I suspect the userbase has become more broadly "interested in software" rather than "ships production facing code" and the enthusiasm in these comments reflects those interests.

FWIW we use some LLMs in production and we do not use ollama at all. Our prod story is very different than what folks are talking about here and I'd love to have a thread that focuses more on language model prod deployments.

jart · on March 15, 2024

Well you would be one of the few hundred people on the planet doing that. With local LLMs we're just trying to create a way for everyone else to use AI that doesn't require sharing all their data with them. First thing everyone asks for of course is how to turn the open source local llms into their own online service.

Karrot_Kream · on March 15, 2024

Ollama's purpose and usefulness is clear. I don't think anyone is disputing that nor the large usability gains ollama has driven. At least I'm not.

As far as being one of the few hundred on the planet, well yeah that's why I'm on HN. There's tons of publications and subreddits and fora for generic tech conversation. I come here because I want to talk about the unknowns.

kergonath · on March 16, 2024

> I come here because I want to talk about the unknowns.

Your knowns an are unknowns to some people and vice versa. This is a great strength of HN; on a whole lot of subject you’ll find people ranging from enthusiastic to expert. There are probably subreddits or discord servers tailored to narrow niches and that’s cool, but HN is not that. They are complementary, if anything. In contrast, HN is much more interesting and with a much better S/N ratio than generic tech subreddits, it’s not even comparable.

Karrot_Kream · on March 16, 2024

I've been using this site since 2008. This is my second account from 2009. HN very much used to be a tech niche site. I realize that for newer users to HN, the appeal is like r/programming or r/technology but with a more text oriented interface or higher SNR or whatever but this is a shift in audience and there are folks like I on this site who still want to use it for niche content.

There are still threads where people do discuss gory details, even if the topics aren't technical. A lot of the mapping articles on the site bring out folks with deep knowledge about mapping stacks. Alternate energy threads do it too. It can be like that for LLMs also, but the user base has to want this site to be more than just Yet Another Tech News Aggregator thread.

For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly. The LLM threads bring me here because experts like jart chime in.

kergonath · on March 16, 2024

> I've been using this site since 2008. This is my second account from 2009.

I did not really like HN back in the day because it felt too startup-y, but maybe I got a wrong impression. I much preferred Ars Technica and their forum (now, Ars is much less compelling).

> For me as of late I've come to realize that the current audience wants YATNE more than they want deep discussion here and so I modulate my time here accordingly.

I think it depends on the stories. Different subjects have different demographics, and I have almost completely stopped reading physics stuff because it is way too Reddit-like (and full of confident people asserting embarrassingly wrong facts). I can see how you could feel about fields closer to your interests being dumbed down by over-confident non-specialists.

There are still good, highly technical discussions, but it is true that the home page is a bit limited and inefficient to find them.

eclectic29 · on March 15, 2024

Few hundred on the planet? Are you kidding me? We're asking enterprises to run LLMs on-premise (I'm intentionally discounting the cloud scenario where the traffic rates are much higher). That's way more than a hundred and sorry to break it to you that Ollama is just not going to cut it.

Karrot_Kream · on March 15, 2024

No need to be angry about this. Tech folks should be discussing this collectively and collaboratively. There's space for everything from local models running on smartphones all the way up to OpenAI style industrialized models. Back when social networks were first coming out, I used to read lots of comments about deploying and running distributed systems. I remember reading early incident reports about hotspotting and consistent hashing and TTL problems and such. We need to foster more of that kind of conversation for LMs. Sadly right now Xitter seems to be the best place for that.

eclectic29 · on March 16, 2024

Not angry. Having a discussion :-). It just amazes me how the HN crowd is more than happy with just trying out a model on their machine and calling it a day and not seeing the real picture ahead. Let ignore perf concerns for a moment. Let's say I want to run it on a shared server in the enterprise network so that any application can make use of it. Each application might want to use a model of their choosing. Ollama will unload/load/unload models as each new request arrives. Not sure if folks here are realizing this :-)

theshackleford · on March 16, 2024

> Not sure if folks here are realizing this :-)

I’m not sure you’re capable of understanding that your needs and requirements are just that, yours.

evilduck · on March 15, 2024

I am 100% uninterested in your production deployment of rent seeking behavior for tools and models I can run myself. Ollama empowers me to do more of that easier. That’s why it’s popular.

jameshart · on March 15, 2024

OP’s point is more that Ollama isn’t what’s doing the empowering. Llama.cpp is.

elwebmaster · on March 15, 2024

You are not making any sense. I am running ollama and Open WebUI (which takes care of auth) in production.

vikramkr · on March 15, 2024

It's nice for personal use which is what I think it was built for, has some nice frontend options too. The tooling around it is nice, and there are projects building in rag etc. I don't think people are intending to deploy days services through these tools

airocker · on March 15, 2024

More than one is easy: put it behind a load balancer. Put one ollama in one container or one port.

eclectic29 · on March 15, 2024

FWIW Ollama has no concurrency support even though llama.cpp's server component (the thing that Ollama actually uses) supports it. Besides, you can't have more than 1 model running. Unloading and loading models is not free. Again, there's a lot more and really much of the real optimization work is not in Ollama; it's in llama.cpp which is completely ignored in this equation.

airocker · on March 15, 2024

Thanks! Great to know. I did not know llama.cpp could do this. It should be pretty straight forward to support, not sure why they would not do it.

sdesol · on March 16, 2024

I'm pretty sure their primary focus right now is to gain as much mindshare as possible and they seem to be doing a great job of it. If you look at the following GitHub metrics:

https://devboard.gitsense.com/ggerganov?r=ggerganov%2Fllama....

https://devboard.gitsense.com/ollama?r=ollama%2Follama&nb=tr...

The number of people engaging with ollama is twice that of llama.cpp. And there hasn't been a dip in people engaging with Ollama in the past 6 months. However, what I do find interesting with regards to these two projects is the number of merged pull requests. If you click on the "Groups" tab and look at "Hooray", you can see llama.cpp had 72 contributors with one or more merged pull requests vs 25 for Ollama.

For Ollama, people are certainly more interested in commenting and raising issues. Compare this to llama.cpp, where the number of people contributing code changes is double that of Ollama.

I know llama.cpp is VC funded and if they don't focus on make using llama.cpp as easy to use as Ollama, they may find themselves doing all the hard stuff with Ollama reaping all the benefits.

Full Disclosure: The tool that I used is mine.

Zambyte · on March 15, 2024

That is still one model per instance of Ollama, right?

airocker · on March 15, 2024

yes, not sure you can do better than that. You cannot still have one instance of LLM in (GPU) memory answer two queries at one time.

eclectic29 · on March 15, 2024

Of course, you can support concurrent requests. But Ollama doesn't support it and it's not meant for this purpose and that's perfectly ok. That's not the point though. For fast/perf scenarios, you're better off with vllm.

airocker · on March 15, 2024

Thanks! This is great to know.

sofixa · on March 15, 2024

This is great news. The more projects do this, the less of a moat CUDA is, and the less of a competitive advantage Nvidia has.

anonymous-panda · on March 15, 2024

What does performance look like?

Zambyte · on March 15, 2024

I get 35tps on Mistral:7b-Instruct-Q6_K with my 6650 XT.

sevagh · on March 15, 2024

Spoiler alert: not good enough to break CUDA's moat

qeternity · on March 15, 2024

This is not CUDA's moat. That is on the R&D/training side.

Inference side is partly about performance, but mostly about cost per token.

And given that there has been a ton of standardization around LLaMA architectures, AMD/ROCm can target this much more easily, and still take a nice chunk of the inference market for non-SOTA models.

imtringued · on March 16, 2024

Hypotheticals don't matter. The average user won't have the most expensive GPU and when it comes to VRAM AMD is half as expensive so they lead in this area.

bornfreddy · on March 15, 2024

Not sure why you're downvoted, but as far as I've heard AMD cards can't beat 4090 - yet.

Still, I think AMD will catch or overtake NVidia in hardware soon, but software is a bigger problem. Hopefully the opensource strategy will pay off for them.

arein3 · on March 15, 2024

Really hope so, maybe this time will catch and last

Usually when corps open source stuff to get adoption, they stuff the adopters after they gain enough market share and the cycle repeats again

nerdix · on March 15, 2024

A RTX 4090 is about twice the price of and 50%-ish faster than AMD's most expensive consumer card so I'm not sure anyone really expects it to ever surpass a 4090.

A 7900 XTX beating a RTX 4080 at inference is probably a more realistic goal though I'm not sure how they compare right now.

Zambyte · on March 15, 2024

The 4080 is $1k for 16gb of VRAM, and the 7900 is $1k for 24gb of VRAM. Unless you're constantly hammering it with requests, the extra speed you may get with CUDA on a 4080 is basically irrelevant when you can run much better models at a reasonable speed.

ixaxaar · on March 15, 2024

Hey I did, and sorry for the self promo,

Please check out https://github.com/geniusrise - tool for running llms and other stuff, behaves like docker compose, works with whatever is supported by underlying engines:

Huggingface - MPS, cuda VLLM - cuda, ROCm llama.cpp, whisper.cpp - cuda, mps, rocm

Also coming up integration with spark (TorchDistributor), kafka and airflow.

KronisLV · on March 15, 2024

Feels like all of this local LLM stuff is definitely pushing people in the direction of getting new hardware, since nothing like RX 570/580 or other older cards sees support.

On one hand, the hardware nowadays is better and more powerful, but on the other, the initial version of CUDA came out in 2007 and ROCm in 2016. You'd think that compute on GPUs wouldn't require the latest cards.

mysteria · on March 15, 2024

The Ollama backend llama.cpp definitely supports those older cards with the OpenCL and Vulkan backends, though performance is worse than ROCm or CUDA. In their Vulkan thread for instance I see people getting it working with Polaris and even Hawaii cards.

https://github.com/ggerganov/llama.cpp/pull/2059

Personally I just run it on CPU and several tokens/s is good enough for my purposes.

bradley13 · on March 15, 2024

No new hardware needed. I was shocked that Mixtral runs well on my laptop, which has a so-so mobile GPU. Mixtral isn't hugely fast, but definitely good enough!

hugozap · on March 15, 2024

I'm a happy user of Mistral on my Mac Air M1.

isoprophlex · on March 15, 2024

How many gbs of RAM do you have in your M1 machine?

hugozap · on March 15, 2024

isoprophlex · on March 15, 2024

Thanks, wow, amazing that you can already run a small model with so little ram. I need to buy a new laptop, guess more than 16 gb on a macbook isn't really needed

SparkyMcUnicorn · on March 15, 2024

I would advise getting as much RAM as you possibly can. You can't upgrade later, so get as much as you can afford.

Mine is 64GB, and my memory pressure goes into the red when running a quantized 70B model with a dozen Chrome tabs open.

evilduck · on March 15, 2024

I use several LLM models locally for chat UIs and IDE autocompletions like copilot (continue.dev).

Between Teams, Chrome, VS Code, Outlook, and now LLMs my RAM usage sits around 20-22GB. 16GB will be a bottleneck to utility.

TylerE · on March 15, 2024

I've run LLMs and some of the various image models on my M1 Studio 32GB without issue. Not as fast as my old 3080 card, but considering the Mac all in has about a 5th the power draw, it's a lot closer than I expected. I'm not sure of the exact details but there is clearly some secret sauce that allows it to leverage the onboard NN hardware.

dartos · on March 15, 2024

Mistral is _very_ small when quantized.

I’d still go with 16gbs

jonplackett · on March 15, 2024

Is it easy to set this up?

LoganDark · on March 15, 2024

Super easy. You can just head down to https://lmstudio.ai and pick up an app that lets you play around. It's not particularly advanced, but it works pretty well.

It's mostly optimized for M-series silicon, but it also technically works on Windows, and isn't too difficult to trick into working on Linux either.

glial · on March 15, 2024

Also, https://jan.ai is open source and worth trying out too.

LoganDark · on March 15, 2024

Looks super cool, though it seems to be missing a good chunk of features, like the ability to change the prompt format. (Just installed it myself to check out all the options.) All the other missing stuff I can see though is stuff that LM Studio doesn't have either (such as a notebook mode). If it has a good chat mode then that's good enough for most!

hugozap · on March 15, 2024

It is, it doesn't require any setup.

After installation:

> ollama run mistral:latest

superkuh · on March 15, 2024

llama.cpp added first class support for the RX 580 by implementing the vulkan backend. There are some issues on older kernel amdgpu code where a llm process VRAM is never reloaded if it gets kicked out to GTT (in 5.x kernels) but overall it's much faster than the clBLAST opencl implementation.

jmorgan · on March 15, 2024

The compatibility matrix is quite complex for both AMD and NVIDIA graphics cards, and completely agree: there is a lot of work to do, but the hope is to gracefully fall back to older cards.. they still speed up inference quite a bit when they do work!

XCSme · on March 15, 2024

My 1080ti runs ok even this 47B model: https://ollama.com/library/dolphin-mixtral

jjice · on March 15, 2024

Just downloaded this and gave it a go. I have no experience with running any local models, but this just worked out of the box on my 7600 on Ubuntu 22. This is fantastic.

rahimnathwani · on March 15, 2024

Here is the commit that added ROCm support to llama.cpp back in August:

https://github.com/ggerganov/llama.cpp/commit/6bbc598a632560...

65a · on March 16, 2024

Yep, and it deserves the credit! He who writes the cuda kernel (or translates it) controls the spice.

I had wrapped this and had it working in Ollama months ago as well: https://github.com/ollama/ollama/pull/814. I don't use Ollama anymore, but I really like the way they handle device memory allocation dynamically, I think they were the first to do this well.

rahimnathwani · on March 16, 2024

I'm curious about both:

- what's special about the memory allocation, and how might it help me?

- what are you now using instead of ollama?

65a · on March 16, 2024

Ollama does a nice job of looking at how much VRAM the card has and tuning the number of gpu layers offloaded. Before that, I mainly just had to guess. It's still a heuristic, but I thought that was neat.

I'm mainly just using llama.cpp as a native library now, mainly for the direct access to more of llama's data structures, and because I have a sort of unique sampler setup.

rahimnathwani · on March 16, 2024

Oh right... I've just been guessing, to try and find the value one fewer than the one which causes CUDA OOM errors.

reilly3000 · on March 15, 2024

I’m curious as to how they pulled this off. OpenCL isn’t that common in the wild relative to Cuda. Hopefully it can become robust and widespread soon enough. I personally succumbed to the pressure and spent a relative fortune on a 4090 but wish I had some choice in the matter.

Apofis · on March 15, 2024

I'm surprised they didn't speak about the implementation at all. Anyone got more intel?

harwoodr · on March 15, 2024

ROCm: https://github.com/ollama/ollama/commit/6c5ccb11f993ccc88c47...

skipants · on March 15, 2024

Another giveaway that it's ROCm is that it doesn't support the 5700 series...

I'm really salty because I "upgraded" to a 5700XT from a Nvidia GTX 1070 and can't do AI on the GPU anymore, purely because the software is unsupported.

But, as a dev, I suppose I should feel some empathy that there's probably some really difficult problem causing 5700XT to be unsupported by ROCm.

JonChesterfield · on March 15, 2024

I wrote a bunch of openmp code on a 5700XT a couple of years ago, if you're building from source it'll probably run fine

refulgentis · on March 15, 2024

They're open source and based on llama.cpp so nothings secret.

My money, looking at nothing, would be on one of the two Vulkan backends added in Jan/Feb.

I continue to be flummoxed by a mostly-programmer-forum treating ollama like a magical new commercial entity breaking new ground.

It's a CLI wrapper around llama.cpp so you don't have to figure out how to compile it

washadjeffmad · on March 15, 2024

I tried it recently and couldn't figure out why it existed. It's just a very feature limited app that doesn't require you to know anything or be able to read a model card to "do AI".

And that more or less answered it.

dartos · on March 15, 2024

It’s because most devs nowadays are new devs and probably aren’t very familiar with native compilation.

So compiling the correct version of llama.cpp for their hardware is confusing.

Compound that with everyone’s relative inexperience with configuring any given model and you have prime grounds for a simple tool to exist.

That’s what ollama and their Modelfiles accomplish.

tracerbulletx · on March 15, 2024

It's just because it's convenient. I wrote a rich text editor front end for llama.cpp and I originally wrote a quick go web server with streaming using the go bindings, but now I just use ollama because it's just simpler and the workflow for pulling down models with their registry and packaging new ones in containers is simpler. Also most people who want to play around with local models aren't developers at all.

imtringued · on March 16, 2024

I'm not sure why you are assuming that ollama users are developers when there are at least 30 different applications that have direct API integration with ollama.

mypalmike · on March 15, 2024

Eh, I've been building native code for decades and hit quite a few roadblocks trying to get llama.cpp building with cuda support on my Ubuntu box. Library version issues and such. Ended up down a rabbit hole related to codenames for the various Nvidia architectures... It's a project on hold for now.

Weirdly, the Python bindings built without issue with pip.

refulgentis · on March 15, 2024

Edited it out of my original comment because I didn't want to seem ranty/angry/like I have some personal vendatta, as opposed to just being extremely puzzled, but it legit took me months to realize it wasn't a GUI because of how it's discussed on HN, i.e. as key to democratizing, as a large, unique, entity, etc.

Hadn't thought about it recently. After seeing it again here, and being gobsmacked by the # of genuine, earnest, comments assuming there's extensive independent development of large pieces going on in it, I'm going with:

- "The puzzled feeling you have is simply because llama.cpp is a challenge on the best of days, you need to know a lot to get to fully accelerated on ye average MacBook. and technical users don't want a GUI for an LLM, they want a way to call an API, so that's why there isn't content extalling the virtues of GPT4All*. So TL;DR you're old and have been on computer too much :P"

but I legit don't know and still can't figure it out.

* picked them because they're the most recent example of a genuinely democratizing tool that goes far beyond llama.cpp and also makes large contributions back to llama.cpp, ex. GPT4All landed 1 of the 2 vulkan backends

j33zusjuice · on March 15, 2024

Ahhhh, I see what you did there.

moffkalast · on March 15, 2024

OpenCL is as dead as OpenGL and the inference implementations that exist are very unperformant. The only real options are CUDA, ROCm, Vulkan and CPU. And Vulkan is a proper pain too, takes forver to build compute shaders and has to do so for each model. It only makes sense on Intel Arc since there's nothing else there.

zozbot234 · on March 15, 2024

SYCL is a fairly direct successor to the OpenCL model and is not quite dead, Intel seems to be betting on it more than others.

mpreda · on March 15, 2024

ROCm includes OpenCL. And it's a very performant OpenCL implementation.

taminka · on March 15, 2024

why though? except for apple, most vendors still actively support it and newer versions of OpenCL are released…

karmakaze · on March 15, 2024

It would serve Nvidia right if their insistence on only running CUDA workloads on their hardware results in adoption of ROCm/OpenCL.

aseipp · on March 15, 2024

You can use OpenCL just fine on Nvidia, but CUDA is just a superior compute programming model overall (both in features and design.) Pretty much every vendor offers something superior to OpenCL (HIP, OneAPI, etc), because it simply isn't very nice to use.

karmakaze · on March 15, 2024

I suppose that's about right. The implementors are busy building on a path to profit and much less concerned about any sort-of lock-in or open standards--that comes much later in the cycle.

KeplerBoy · on March 15, 2024

OpenCL is fine on Nvidia Hardware. Of course it's a second class citizen next to CUDA, but then again everything is a second class citizen on AMD hardware.

shmerl · on March 15, 2024

May be Vulkan compute? But yeah, interesting how.

programmarchy · on March 15, 2024

Apple killed off OpenCL for their platforms when they created Metal which was disappointing. Sounds like ROCm will keep it alive but the fragmentation sucks. Gotta support CUDA, OpenCL, and Metal now to be cross-platform.

jart · on March 15, 2024

What is OpenCL? AMD GPUs support CUDA. It's called HIP. You just need a bunch of #define statements like this:

    #ifndef __HIP__
    #include <cuda_fp16.h>
    #include <cuda_runtime.h>
    #else
    #include <hip/hip_fp16.h>
    #include <hip/hip_runtime.h>
    #define cudaSuccess hipSuccess
    #define cudaStream_t hipStream_t
    #define cudaGetLastError hipGetLastError
    #endif

Then your CUDA code works on AMD.

programmarchy · on March 16, 2024

OpenCL is a Khronos open spec for GPU compute, and what you’d use on Apple platforms before Metal compute shaders and CoreML were released. If you wanted to run early ML models on Apple hardware, it was an option. There was an OpenCL backend for torch, for example.

jiggawatts · on March 15, 2024

Can you explain why nobody knows this trick, for some values of “nobody”?

jart · on March 15, 2024

No idea. My best guess is their background is in graphics and games rather than machine learning. When CUDA is all you've ever known, you try just a little harder to find a way to keep using it elsewhere.

wmf · on March 15, 2024

People know; it just hasn't been reliable.

jart · on March 15, 2024

What's not reliable about it? On Linux hipcc is about as easy to use as gcc. On Windows it's a little janky because hipcc is a perl script and there's no perl interpreter I'll admit. I'm otherwise happy with it though. It'd be nice if they had a shell script installer like NVIDIA, so I could use an OS that isn't a 2 year old Ubuntu. I own 2 XTX cards but I'm actually switching back to NVIDIA on my main workstation for that reason alone. GPUs shouldn't be choosing winners in the OS world. The lack of a profiler is also a source of frustration. I think the smart thing to do is to develop on NVIDIA and then distribute to AMD. I hope things change though and I plan to continue doing everything I can do to support AMD since I badly want to see more balance in this space.

wmf · on March 15, 2024

The compilation toolchain may be reliable but then you get kernel panics at runtime.

jart · on March 15, 2024

I've heard geohot is upset about that. I haven't tortured any of my AMD cards enough to run into that issue yet. Do you know how to make it happen?

imtringued · on March 16, 2024

Last time I used AMD GPUs for GPGPU all it took was running hashcat to make the desktop rendering unstable. I'm sure leaving it run overnight would've gotten me a system crash.

jart · on March 16, 2024

That's always happened with NVIDIA on Linux too, because Linux is an operating system that actually gives you the resources you ask for. Consider using a separate video card that's dedicated to your video needs. Otherwise you should use MacOS or Windows. It's 10x slower at building code. But I can fork bomb it while training a model and Netflix won't skip a frame. Yes I've actually done this.

Symmetry · on March 15, 2024

Given the price of top line NVidia cards, if they can be had at all, there's got to be a lot of effort going on behind the scenes to improve AMD support in various places.

jeff-davis · on March 16, 2024

What are the barriers to doing so?

simon83 · on March 15, 2024

Does anyone know how the AMD consumer GPU support on Linux has been implemented? Must use something else than ROCm I assume? Because ROCm only supports the 7900 XTX on Linux[1], while on Windows[2] support is from RX 6600 and upwards.

[1]: https://rocblas.readthedocs.io/en/rocm-6.0.0/about/compatibi... [2]: https://rocblas.readthedocs.io/en/rocm-6.0.0/about/compatibi...

Symmetry · on March 15, 2024

The newest release, 6.0.2, supports a number of other cards[1] and in general people are able to get a lot more cards to work than are officially supported. My 7900 XT worked on 6.0.0 for instance.

[1]https://rocm.docs.amd.com/projects/install-on-linux/en/lates...

bravetraveler · on March 15, 2024

I wouldn't read too much into support. It's more in terms of business/warranty/promises than what can actually do things

I've had a 6900XT since launch and this is the first I'm hearing "unsupported", having played with ROCM plenty over the years with Fedora Linux.

I think, at most, it's taken a couple key environment variables

HarHarVeryFunny · on March 15, 2024

How hard would it be for AMD just to document the levels of support of different cards the way NVIDIA does with their "compute capability" numbers ?!

I'm not sure what is worse from AMD - the ML software support they provide for their cards, or the utterly crap documentation.

How about one page documenting AMD's software stack compared to NVIDIA, one page documenting what ML frameworks support AMD cards, and another documenting "compute capability" type numbers to define the capabilities of different cards.

londons_explore · on March 15, 2024

And almost looks like they're deliberately trying to not win any market share.

It's as if the CEO is mates with NVidias CEO and has an unwritten agreement not to try too hard to topple the applecart...

Oh wait... They're cousins!

SushiHippie · on March 15, 2024

Can anyone (maybe ollama contributors) explain to me the relationship between llama.cpp and ollama?

I always thought that ollama basically just was a wrapper (i.e. not much changes to inference code, and only built on top) around llama.cpp, but this makes it seem like it is more than that?

Zambyte · on March 15, 2024

llama.cpp also supports AMD cards: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#hi...

airocker · on March 15, 2024

I heard "Nvidia for LLM today is similar to how Sun Microsystems was for the web"

api · on March 15, 2024

... for a very brief period of time until Linux servers and other options caught up.

airocker · on March 15, 2024

OS was possibly much more complicated to write at that time than CUDA is to write today. And competition is too strong. It might be more briefer than even Sun.

deadalus · on March 15, 2024

I wish AMD did well in the Stable Diffusion front because AMD is never greedy on VRAM. The 4060Ti 16GB(minimum required for Stable Diffusion in 2024) starts at $450.

AMD with ROCm is decent on Linux but pretty bad on Windows.

Adverblessly · on March 15, 2024

I run A1111, ComfyUI and kohya-ss on an AMD (6900XT which has 16GB, the minimum required for Stable Diffusion in 2024 ;)), though on Linux. Is it a Windows specific Issue for you?

Edit to add: Though apparently I still don't run ollama on AMD since it seems to disagree with my setup.

wastewastewaste · on March 16, 2024

You don't need 16gb, literally majority of people don't have that and use 8gb and up. especially with forge

choilive · on March 15, 2024

They bump up VRAM because they can't compete on raw compute.

wongarsu · on March 15, 2024

Or rather Nvidia is purposefully restricting VRAM to avoid gaming cards canibalizing their supremely profitable professional/server cards. AMD has no relevant server cards, so they have no reason to hold back on VRAM in consumer cards

karolist · on March 16, 2024

Nvidia released consumer RTX 3090 with 24GB VRAM in Sep 2020, AMDs flagship release in that same month was 6900 XT with 16GB VRAM. Who is being restrictive here exactly?

risho · on March 15, 2024

it doesn't matter how much compute you have if you don't have enough vram to run the model.

Zambyte · on March 15, 2024

Exactly. My friend was telling me that I was making a mistake for getting a 7900 XTX to run language models, when the fact of the matter is the cheapest NVIDIA card with 24 GB of VRAM is over 50% more expensive than the 7900 XTX. Running a high quality model at like 80 tps is way more important to me than running a way lower quality model at like 120 tps.

mejutoco · on March 16, 2024

I cannot parse this. The Radeon RX 7900 XTX also has 24GB of vram, so how does it help you run higher quality models? I would understand if it had more ram.

Zambyte · on March 16, 2024

Only the RX 7900 XTX has 24 GB of VRAM at its price point. If I went with an NVIDIA card, I would either have to spend over 50% more on the card, or use much worse models to fit on their 16 GB cards.

api · on March 15, 2024

They lag on software a lot more than the lag on silicon.

OtomotO · on March 15, 2024

Hm, fooocus manages to run, but for Ollama I get:

>time=2024-03-16T00:11:07.993+01:00 level=WARN source=amd_linux.go:50 msg="ollama >recommends running the https://www.amd.com/en/support/linux-drivers: amdgpu version file >missing: /sys/module/amdgpu/version stat /sys/module/amdgpu/version: no such file or >directory" >time=2024-03-16T00:11:07.993+01:00 level=INFO source=amd_linux.go:85 msg="detected amdgpu >versions [gfx1031]" >time=2024-03-16T00:11:07.996+01:00 level=WARN source=amd_linux.go:339 msg="amdgpu >detected, but no compatible rocm library found. Either install rocm v6, or follow manual >install instructions at https://github.com/ollama/ollama/blob/main/docs/linux.md#man..." >time=2024-03-16T00:11:07.996+01:00 level=WARN source=amd_linux.go:96 msg="unable to verify >rocm library, will use cpu: no suitable rocm found, falling back to CPU" >time=2024-03-16T00:11:07.996+01:00 level=INFO source=routes.go:1105 msg="no GPU detected"

Need to check how to install rocm on arch again... have done it once, a few moons back, but alas...

jmorgan · on March 15, 2024

Ah, this is probably from missing ROCm libraries. The dynamic libraries are available as one of the release assets (warning: it's about 4GB expanded) https://github.com/ollama/ollama/releases/tag/v0.1.29 – dropping them in the same directory as the `ollama` binary should work.

tarruda · on March 15, 2024

Does this work with integrated Radeon graphics? If so it might be worth getting one of those Ryzen mini PCs to use as a local LLM server.

stebalien · on March 15, 2024

Not yet: https://github.com/ollama/ollama/issues/2637

renewiltord · on March 15, 2024

Wow, that's a huge feature. Thank you, guys. By the way, does anyone have a preferred case where they can put 4 AMD 7900XTX? There's a lot of motherboards and CPUs that support 128 lanes. It's the physical arrangement that I have trouble with.

segmondy · on March 15, 2024

You don't need 128 lanes. 8x PCIe3 is more than enough, so for 4 cards that's 32. Most CPUs have about 40lanes. If you are not doing much that would be more than sufficient. Buy a PCIe riser. Go to amazon and search for it, a 16x to 16x PCIe riser. They go for about $25-$30 often about 20-30cm. If you want really long one, you can get one from China a 60cm for about the same price, you just have to wait for 3 weeks. That's what I did. Stuffing all those in a case is often difficult, so you have to go open rig. Either have the cables running out your computer and figuring out a way to support the cards while keeping them cool or just buy a small $20-$30 open rig frame.

Dylan16807 · on March 16, 2024

Most CPUs have about 20 lanes (plus a link to the chipset).

On the one hand, they will be gen 4 or 5, so they're the equivalent of 40-80 gen 3 lanes.

On the other hand, you can only split them up if you have a motherboard that supports bifurcation. If you buy the wrong model, you're stuck dedicating the equivalent of 64 gen 3 lanes to a single card.

Edit: Actually, looking into it further, current Intel desktop processors will only run their lanes as 16(+4) or 8+8(+4). You can kind of make 4 cards work by using chipset-fed slots, but that sucks. You could also get a PCIe switch but those are very expensive. AMD will do 4+4+4+4(+4) on the right boards.

segmondy · on March 22, 2024

I suppose you're right, I don't know much about desktop cpus. I'm using cheap $10-$20 xeon E5-26XX CPUs which offer 40 lanes.

renewiltord · on March 16, 2024

Ah ha, that's the part I was curious about. I was wondering if I could keep everything cool open rig. I'm waiting for the stuff to arrive: risers, board, CPU, GPUs. And I've been putting it off because I wasn't sure how about the case. All right then, open rig frame. Thank you!

nottorp · on March 15, 2024

Used crypto mining parts not available any more?

duskwuff · on March 15, 2024

Crypto mining didn't require significant bandwidth to the card. Mining-oriented motherboards typically only provisioned a single lane of PCIe to each card, and often used anemic host CPUs (like Celeron embedded parts).

cjbprime · on March 16, 2024

Does LLM inference require significant bandwidth to the card? You have to get the model into VRAM, but that's a fixed startup cost, not a per-output-token cost.

renewiltord · on March 15, 2024

Exactly. They'd use PCIe x1 to PCIe x16 risers with power adapters. These require high-bandwidth.

nottorp · on March 15, 2024

Oh. Shows I wasn't into that.

I did once work with a crypto case, but yes, it was one motherboard with a lot of wifis and we still didn't need the pcie lanes.

rcarmo · on March 15, 2024

Curious to see if this will work on APUs. Have a 7840HS to test, will give it a go ASAP.

latchkey · on March 15, 2024

If anyone wants to run some benchmarks on MI300x, ping me.

kodarna · on March 15, 2024

I wonder why they aren't supporting RX 6750 XT and lower yet, are there architectural differences between these and RX 6800+?

Zambyte · on March 15, 2024

They don't support it, but it works if you set an environment variable.

https://github.com/ollama/ollama/issues/2870#issuecomment-19...

slavik81 · on March 15, 2024

Those are Navi 22/23/24 GPUs while the RX 6800+ GPUs are Navi 21. They have different ISAs... however, the ISAs are identical in all but name.

LLVM has recently introduced a unified ISA for all RDNA 2 GPUs (gfx10.3-generic), so the need for the environment variable workaround mentioned in the other comment should eventually disappear.

VadimPR · on March 16, 2024

Ollama runs really, really slow on my MBP for Mistral - as in just a few tokens a second and it takes a long while before it starts giving a result. Anyone else run into this?

I've seen that it seems to be related to the amount of system memory available when ollama is started (??) however LM Studio does not have such issues.

userbinator · on March 16, 2024

Thanks, Ollama.

(Sorry, could not resist.)

haolez · on March 15, 2024

Is there an equivalent to ollama or gpt4all for Android? I'd like to host my model somewhere and talk to it via app.

verdverm · on March 15, 2024

At that point, it sounds like your after an API endpoint for any model. Lots of solutions out there, depends more on your hosting

haolez · on March 17, 2024

No, I want the opposite: I want a generic chatbot app that connects to any endpoint.

verdverm · on March 18, 2024

Ah, that's not what ollama is, it runs models for you, it doesn't talk to them over an API

I would imagine someone has built an app that can talk to LLM apis

BOOSTERHIDROGEN · on March 19, 2024

Ollama can serve as an endpoint.

skenderbeu · on March 16, 2024

llama.cpp now supports AMD graphics. Ollama just wraps around it

treprinum · on March 15, 2024

Any info on the performance? How does it compare to 4080/4090?