Hacker Newsnew | past | comments | ask | show | jobs | submit | dirtyhand's commentslogin

I was considering getting an RTX 5090 to run inference on some LLM models, but now I’m wondering if it’s worth paying an extra $2K for this option instead


If you want to run small models fast get the 5090. If you want to run large models slow get the Spark. If you want to run small models slow get a used MI50. If you want to run large models fast get a lot more money.


You might be able to do "large models slow" better than the spark with a 5090 and CPU offload, so long as you stick with MoE architectures. With the kv cache and shared parts of the model on GPU and all of the experts on CPU, it can work pretty well. I'm able to run ~400GB models at 10 tps with some A4000s and a bunch of RAM. That's on a Xeon W system with poor practical memory bandwidth (~190GB/s), you can do better with EPYC.


RTX 5090 is about as good as it gets for home use. Its inference speeds are extremely fast.

The limiting factor is going to be the VRAM on the 5090, but nvidia intentionally makes trying to break the 32GB barrier extremely painful - they want companies to buy their $20,000 GPUs to run inference for larger models.


RTX 5090 for running smaller models.

Then the RTX Pro 6000 for running a little bit larger models (96gb VRAM, but only ~15-20% more perf than 5090).

Some suggest Apple Silicon only for running larger models on a budget because of the unified memory, but the performance won't compare.


No. These are practically useless for AI.

Their prompt processing speeds are absolutely abysmal: if you're trying to tinker from time to time, a GPU like a 5090 or renting GPUs is a much better option.

If you're just trying to prep for impending mainstream AI applications, few will be targeting this form factor: it's both too strong compared to mainstream hardware, and way too weak compared to dedicated AI-focused accelerators.

-

I'll admit I'm taking a less nuanced take than some would prefer, but I'm also trying to be direct: this is not ever going to be a better option than a 5090.


  Their prompt processing speeds are absolutely abysmal
They are not. This is Blackwell with Tensor cores. Bandwidth is the problem here.


They're abysmal compared to anything dedicated at any reasonable batch size because of both bandwidth and compute, not sure why you're wording this like it disagrees with what I said.

I've run inference workloads on a GH200 which is an entire H100 attached to an ARM processor and the moment offloading is involved speeds tank to Mac Mini-like speeds, which is similarly mostly a toy when it comes to AI.


Again, prompt processing isn't the major problem here. It's bandwidth. 256GB/s bandwidth (maybe ~210 in real world) limits the tokens per second well before prompt processing.

Not entirely sure how your ARM statement matters here. This is unified memory.


[flagged]


What model are you running?

I suspect that you’re running a very large model like DeepSeek in coherent memory?

Keep in mind that this little DGX only has 128GB which means it can run fairly small models such as qwen3 coder where prompt processing is not an issue.

I’m not doubting your experience with GH200 but it doesn’t seem relevant here because the bandwidth for Spark is the bottleneck well before the prompt processing.


I like the cut of your jib and your experience matches mine, but without real numbers this is all just piss in the wind (as far as online discussions go).


You're right, it's unfortunate I didn't keep the benchmarks around: I benchmark a lot of configurations and providers for my site and have a script I typically run that produces graphs for various batch sizes (https://ibb.co/0RZ78hMc)

The performance with offloading was just so bad I didn't even bother proceeding to the benchmark (without offloading you get typical H100 speeds)


I am happy to hear that you had a good first impression. At Netflix, we do some Linux scheduler instrumentation with eBPF and overhead matters. I was inspired to create the tool to enable the traditional performance work loop: get a baseline, tweak code, get another reading, rinse & repeat.


bpftop author here. Would you mind creating an issue to track this? https://github.com/Netflix/bpftop/issues


done (https://github.com/Netflix/bpftop/issues/17). Seems to be some futex issue, the kind of bugs that tend to be hard to replicate.


Who watches the Watchmen?


Titus is a federation layer that sits on top of many Kubernetes clusters


Stock market and crypto currency bot for Slack and Discord https://beeper.fyi/


How does this make money? I don't see any pricing info on the site.


How are you earning revenue on this? Don't see any pricing or signup requirements


CEO of Cambridge Analytica caught on camera saying they have used bribes and sex workers to entrap politicians.

https://www.channel4.com/news/cambridge-analytica-revealed-t...


wow.. that's the most damning report i've ever seen on how politics is manipulated in this modern age.


[flagged]


Apropos of nothing, Ellen Pao commented on how Reddit handles user data:

> In 2014, we decided not to sell reddit user data because there is no way to monitor and verify use. There is no way to ensure data is deleted/corrected. It may be on any engineer's laptop. Who had access to copy it? How do you correct or delete PII or unauth porn after it's sold?

https://twitter.com/ekp/status/975192404167831552


Note that this is essentially the same as facebook’s policy: they did not sell the user data as part of their advertisement business. It was an app that siphoned up user data without Facebook being paid.

A lot of reddit data is actually far more accessible via the API or the public data dump, which is on BigQuery. What makes the data less useful is the comparative anonymity because reddit users tend to use pseudonyms, and the fact that you can’t target users after building a targeting model from scraped data.

You could still trawl the reddit data to find useful correlations, and maybe get a good deal on ads if you notice the overlap of /r/the_donald with /r/bedwetting or whatever.


As someone who's done exactly that with the Reddit data (http://minimaxir.com/2016/06/reddit-related-subreddits/), posts/comments aren't as useful for intent as you note. The valuable Reddit data is the nonpublic views/subscription/upvote/downvote behavior.


No, that's massively different. Engaging in passive observation and reselling that data, while sleazy, is not the same as actively setting someone up.


Source on spez?


Very different. CA offered an unethical and possibly illegal service for sale.

Show me where Spez did that and I'll agree with you.


Blackmailing politicians is definitively on the illegal side.


I'd love to read up on spez blackmailing politicians, if you could please provide a link.


The smart thing to do here is build a Twitter background check app and sell a subscription to the NYTimes.


This is a pretty good idea. Something that scans tweets and other publicly accessible social media for a list of known words and then compiles it into a report for potential employers would go gangbusters in the HR/hiring areas. It obviously couldn't get everything, but would be a big help for employers.


Sounds wonderful. /s An automated tool that disqualifies people from employment for something ill-advised they once wrote in college based on a naive algorithm.

ADDED: Of course, if you want to be an authentic disruptive Silicon Valley startup you'd offer both this service and a service for end users to expunge any online content that would be flagged by their employer service. Should be good for some fawning tech press writeups.


I'm sure people that don't get a job because of something they said in the past wouldn't like it. I don't really like the idea of not hiring someone based on behavior far in the past (recent public behavior on social media is fair game in my mind).

However, in terms of a startup product that companies would pay for, I stand by my claim that it's a great idea. After all, HN is ran by yCombinator and startups are kind if its thing.


Smurfs 2, really?


Sponsored by Comcast. It's one way of reducing the bandwidth demands ;)


Google is scared of Pinterest.


What does Pinterest have to do with live chat?


Even less than either has to do with their current business model of being an advertising network through their toolbar.


This is the first I've heard about Pinterest. What do they have to do with Meebo?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: