Article talks about all of this and references DeepSeek R1 paper[0], section 4.2 (first bullet point on PRM) on why this is much trickier to do than it appears.
a large number of breakthroughs in AI are based on turning unsupervised learning into supervised learning (alphazero style MCTS as policy improvers are also like this). So the confusion is kind of intrinsic.
It's interesting to also compare this to getting a bare metal instance and provisioning microVMs on it using Firecracker. (Obviously something you shouldn't roll yourself in most cases.)
You can get a bare metal AX162 from Hetzner for 200 EUR/mo, with 48 cores and 128GB of RAM. For 4:1 virtual:physical oversubscription, you could run 192 guests on such a machine, yielding a cost of 200/192 = 1.04 EUR/mo, and giving each guest a bit over 1GiB of RAM. Interestingly, that's not groundbreakingly cheaper than just getting one of Hetzner's virtual machines!
"Interestingly, that's not groundbreakingly cheaper than just getting one of Hetzner's virtual machines!" .... yea.. cause this is what these companies are doing behind the scenes :)
Yeah that's fair (although the original comment was only talking about energy costs).
But this is kind of a worst case cost analysis. I fully expect that the average non-pro Sora 2 video has one to two orders of magnitude less GPU utilization than I listed here (because I think those video tokens are probably generated at a batch size of ~100 per batch).
Well this was a trip down the memory lane. I built a small game on Irrlicht at the time and I remember these discussions also.
Irrlicht had its editor (irrEdit), a sound system (irrKlang), and some basic collision detection and FPS controller was built right into the engine. This was enough to get you a considerable way through a fully featured tech demo, at the very least. (I even remember Irrlicht including a beautiful first-person tech demo of traversing a large BSP-partitioned castle level.)
However, for those not afraid to stitch these additional parts from other promising libraries (or derive them from first principles, as was fashionable), OGRE offered more raw rendering prowess: a working deferred shading system (this was the heyday of deferred shading), a pop-less terrain implementation with texture splatting, and more impressive shader and rendering pipeline support, with the Cg multi-platform shading language. I remember a fairly impressive ocean surface and Fresnel refraction/reflection demos from OGRE at the time.
What an astounding achievement. In 6 years, this person has written not only a very well-designed microkernel, but a build system, UEFI bootloader, graphical shell, UI framework, and a browser engine.
The story of 10x developers among us is not a myth... if anything, it's understated.
Are you saying that you think Sonnet 4 has 100B-200B _active_ params? And that Opus has 2T active? What data are you basing these outlandish assumptions on?
Oh nothing official. There are people who estimate the sizes based on tok/s, cost, benchmarks etc. The one that most go on is https://lifearchitect.substack.com/p/the-memo-special-editio.... This guy estimated Claude 3 opus to be 2T param model (given the pricing + speed). Opus 4 is 1.2T param according to him (but then I dont understand why the price remained the same.). Sonnet is estimated by various people to be around 100B-200B params.
tok/s cannot in any way be used to estimate parameters. It's a tradeoff made at inference time. You can adjust your batch size to serve 1 user at a huge tok/s or many users at a slow tok/s.
- The compute requirements would be massive compared to the rest of the industry
- Not a single large open source lab has trained anything over 32B dense in the recent past
- There is considerable crosstalk between researchers at large labs; notice how all of them seem to be going in similar directions all the time. If dense models of this size actually provided benefit compared to MoE, the info would've spread like wildfire.
Seems heavily vibe coded, down to the Claude-generated README and a lot of the LLM prompts themselves (which I have found works very poorly compared to human-written prompts). While none of this is necessarily bad, it requires a higher burden of proof that it actually works beyond toy problems [0]. I think everyone would appreciate some examples of vulnerabilities it can find. The missing JWT check showcased in the screenshot would've probably been caught with ordinary AI code review, so to my eye that by itself is not persuasive.
Good luck!
[0]: Why I say this --- a 10kLOC piece of software that was mostly human-written would require a large amount of testing, even manual, to ensure that it works, reliably, at all. All this testing and experimentation would naturally force a certain depth of exploration for the approach, the LLM prompts, etc across a variety of usecases. A mostly AI-written codebase of this size would've required much less testing to get it to "doesn't crash and runs reliably", and so this depth is not a given anymore.
Thanks for sharing this! It's difficult to find good examples of useful codebases where coding agents have done most of the work. I'm always actively looking at how I can push these agents to do more for me and it's very instructive to hear from somebody who has had success on this level. (Would be nice to read a writeup, too)
It's coming soon! I think this experiment has really taught me a lot about the limits of agentic code assistants, stuff that they're good at, they're insanely good at, and stuff that they're horrible at and cannot seem to overcome. I did write a little bit about how I use Claude Code [1] before I started this project a while back, and I'm planning to finish a sequel pretty soon.
[0]: https://arxiv.org/abs/2501.12948