Hacker Newsnew | past | comments | ask | show | jobs | submit | thegeomaster's commentslogin

Article talks about all of this and references DeepSeek R1 paper[0], section 4.2 (first bullet point on PRM) on why this is much trickier to do than it appears.

[0]: https://arxiv.org/abs/2501.12948


You could think of supervised learning as learning against a known ground truth, which pretraining certainly is.


a large number of breakthroughs in AI are based on turning unsupervised learning into supervised learning (alphazero style MCTS as policy improvers are also like this). So the confusion is kind of intrinsic.


It's interesting to also compare this to getting a bare metal instance and provisioning microVMs on it using Firecracker. (Obviously something you shouldn't roll yourself in most cases.)

You can get a bare metal AX162 from Hetzner for 200 EUR/mo, with 48 cores and 128GB of RAM. For 4:1 virtual:physical oversubscription, you could run 192 guests on such a machine, yielding a cost of 200/192 = 1.04 EUR/mo, and giving each guest a bit over 1GiB of RAM. Interestingly, that's not groundbreakingly cheaper than just getting one of Hetzner's virtual machines!


"Interestingly, that's not groundbreakingly cheaper than just getting one of Hetzner's virtual machines!" .... yea.. cause this is what these companies are doing behind the scenes :)


You didn't include the amortized cost of a Blackwell GPU, which is an order of magnitude larger expense than electricity.


Yeah that's fair (although the original comment was only talking about energy costs).

But this is kind of a worst case cost analysis. I fully expect that the average non-pro Sora 2 video has one to two orders of magnitude less GPU utilization than I listed here (because I think those video tokens are probably generated at a batch size of ~100 per batch).


Warning: LLM-generated article, terribly difficult to follow and full of irrelevant details.


Well this was a trip down the memory lane. I built a small game on Irrlicht at the time and I remember these discussions also.

Irrlicht had its editor (irrEdit), a sound system (irrKlang), and some basic collision detection and FPS controller was built right into the engine. This was enough to get you a considerable way through a fully featured tech demo, at the very least. (I even remember Irrlicht including a beautiful first-person tech demo of traversing a large BSP-partitioned castle level.)

However, for those not afraid to stitch these additional parts from other promising libraries (or derive them from first principles, as was fashionable), OGRE offered more raw rendering prowess: a working deferred shading system (this was the heyday of deferred shading), a pop-less terrain implementation with texture splatting, and more impressive shader and rendering pipeline support, with the Cg multi-platform shading language. I remember a fairly impressive ocean surface and Fresnel refraction/reflection demos from OGRE at the time.


What an astounding achievement. In 6 years, this person has written not only a very well-designed microkernel, but a build system, UEFI bootloader, graphical shell, UI framework, and a browser engine.

The story of 10x developers among us is not a myth... if anything, it's understated.


And unlike a similar project, they accomplished it without the benefit of divine guidance.

Very impressive!


The greatest programmer who ever lived. Gifted with divine intellect.


[flagged]


Not with Messiah.ai :D


Oh my God. That domain is parked and for sale for $125,000?!?!

Wild.


Oh that is nothing. Check out god.ai..... domain parking is back. At this point we might as well just have a TLD for .god


> TLD for .god

Sounds like a good TLD for an "identity and access management" system :)


Musk would just hog it for himself


You might enjoy reading the SerentiyOS progress reports

https://serenityos.org/


I want serenity now



Yeah it’s amazing.


Are you saying that you think Sonnet 4 has 100B-200B _active_ params? And that Opus has 2T active? What data are you basing these outlandish assumptions on?


Oh nothing official. There are people who estimate the sizes based on tok/s, cost, benchmarks etc. The one that most go on is https://lifearchitect.substack.com/p/the-memo-special-editio.... This guy estimated Claude 3 opus to be 2T param model (given the pricing + speed). Opus 4 is 1.2T param according to him (but then I dont understand why the price remained the same.). Sonnet is estimated by various people to be around 100B-200B params.

[1]: https://docs.google.com/spreadsheets/d/1kc262HZSMAWI6FVsh0zJ...


If you're using the api cost of the model to estimate it's size, then you can't use this size estimate to estimate the inference cost.


tok/s cannot in any way be used to estimate parameters. It's a tradeoff made at inference time. You can adjust your batch size to serve 1 user at a huge tok/s or many users at a slow tok/s.


Not everyone uses MoE architectures. It's not outlandish at all...


There's no way Sonnet 4 or Opus 4 are dense models.


Citation needed


Common sense:

- The compute requirements would be massive compared to the rest of the industry

- Not a single large open source lab has trained anything over 32B dense in the recent past

- There is considerable crosstalk between researchers at large labs; notice how all of them seem to be going in similar directions all the time. If dense models of this size actually provided benefit compared to MoE, the info would've spread like wildfire.


Seems heavily vibe coded, down to the Claude-generated README and a lot of the LLM prompts themselves (which I have found works very poorly compared to human-written prompts). While none of this is necessarily bad, it requires a higher burden of proof that it actually works beyond toy problems [0]. I think everyone would appreciate some examples of vulnerabilities it can find. The missing JWT check showcased in the screenshot would've probably been caught with ordinary AI code review, so to my eye that by itself is not persuasive.

Good luck!

[0]: Why I say this --- a 10kLOC piece of software that was mostly human-written would require a large amount of testing, even manual, to ensure that it works, reliably, at all. All this testing and experimentation would naturally force a certain depth of exploration for the approach, the LLM prompts, etc across a variety of usecases. A mostly AI-written codebase of this size would've required much less testing to get it to "doesn't crash and runs reliably", and so this depth is not a given anymore.


Thanks for sharing this! It's difficult to find good examples of useful codebases where coding agents have done most of the work. I'm always actively looking at how I can push these agents to do more for me and it's very instructive to hear from somebody who has had success on this level. (Would be nice to read a writeup, too)


It's coming soon! I think this experiment has really taught me a lot about the limits of agentic code assistants, stuff that they're good at, they're insanely good at, and stuff that they're horrible at and cannot seem to overcome. I did write a little bit about how I use Claude Code [1] before I started this project a while back, and I'm planning to finish a sequel pretty soon.

^[1]: https://diwank.space/field-notes-from-shipping-real-code-wit...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: