If we've learned anything so far it's that the parlor tricks of one-shot efficacy only gets you so far. Drill into anything relatively complex with a few hundred thousand tokens of context and the models all start to fall apart roughly the same. Even when I've used Sonnet 4.5 with 1M token context the model starts to flake out and get confused with a codebase of less than 10k LoC. Everyone seems to keep claiming these huge leaps and bounds, but I really have to wonder how many of these are just shilling for their corporate overlord. I asked Gemini 3 to solve a simple, yet not well documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.
Same. I've been needing to update an userscript (JS) that takes stuff like "3 for the price of 1", "5 + 1 free", "35% discount!" from a particular site and then converts the price to a % discount and the price per item / 250 grams.
Its an old userscript so it is glitchy and halfway works. I already pre-chewed the work by telling Gemini 3 exactly which new HTML elements it needs to match and which contents it needs to parse. So basically, the scaffolding is already there, the sources are already there, it just needs to put everything in place.
It fails miserably and produces very convincing looking but failing code. Even letting it iterate multiple times does nothing, nor does nudging it in the correct direction. Mind you that Javascript is probably the most trained-on language together with Python, and parsing HTML is one of the most common usecases.
Another hilarious example is MPV, which has very well-documented settings. I used to think that LLMs would mean you can just tell people to ask Gemini how to configure it, but 9 out of 10 times it will hallucinate a bunch of parameters that never existed.
It gives me an extremely weird feeling when other people are cheering that it is solving problems at superhuman speeds or that it coded a way to ingest their custom XML format in record time, with relatively little prompting. It seems almost impossible that LLMs can both be so bad and so good at the same time, so what gives?
1. Coding with LLMs seems to be all about context management. Getting the LLM to deal with the minimum amount of code needed to fix the problem or build the feature, carefully managing token limits and artificially resetting the session when needed so the context handover is managed, all that. Just pointing an LLM at a large code base and expecting good things doesn't work.
2. I've found the same with Gemini; I can rarely get it to actually do useful things. I have tried many times, but it just underperforms compared to the other mainstream LLMs. Other people have different experiences, though, so I suspect I'm holding it wrong.
The problem is by that point it's much less useful in projects. I still like them but when I get to the point of telling it exactly what to do I'm mostly just being lazy. It's useful in that it might give me some ideas I didn't consider but I'm not sure it's saving time.
Of course, for short one-off scripts, it's amazing. It's also really good at preliminary code reviews. Although if you have some awkward bits due to things outside of your power it'll always complain about them and insist they are wrong and that it can be so much easier if you just do it the naive way.
Amazon's Kiro IDE seems to have a really good flow, trying to split large projects into bite sized chunks. I, sadly, couldn't even get it to implement solitaire correctly, but the idea sounds good. Agents also seem to help a lot since it can just do things from trial and error, but company policy understandably gets complicated quick if you want to provide the entire repo to an LLM agent and run 'user approved' commands it suggests.
From my experience vibe coding, you spend a lot of time preparing documentation and baseline context for the LLM.
On one of my projects, I downloaded a library’s source code locally, and asked Claude to write up a markdown file explaining documenting how to use it with examples, etc.
Like, taking your example for solitaire, I’d ask a LLM to write the rules into a markdown file and tell the coding one to refer to those rules.
I understand it to be a bit like mise en place for cooking.
You tell it what you want and it gives you a list of requirements, which are in that case mostly the rules for Solitaire.
You adjust those until you're happy, then you let it generate tasks, which are essentially epics with smaller tickets in order of dependency.
You approve those and then it starts developing task by task where you can intervene at any time if it starts going off track.
The requirements and tasks, it does really well, but the connection of the epics/larger tasks is where it crumbles mostly. I could have made it work with some more messing around but I've noticed over a couple projects that, at least in my tries, it always crumbles either at the connection of the epics/large tasks or when you ask it to do a small modification later down the line and it causes a lot of smaller, subtle changes all over the place. (could say skill issue since I oversaw something in the requirements, but that's kind of how real projects go, so..)
It also eats tokens like crazy for private usage but that's more so a 'playing around' problem. As it stands I'll probably blow 100$ a day if I connect it to an actual commercial repo and start experimenting. Still viable with my salary, but still..
>documented problem in Home Assistant this evening. All it would take is 3-5 lines of YAML. The model failed miserably. I think we're all still safe.
This is mostly because HA changes so frequently and the documentation is sparse. To get around this and increase my correction rate, I give it access to the source code of the same version I'm running. Then instructions in CLAUDE.md on where to find source and it must use source code.
For this issue, additional Media Player storage locations, the configuration is actually quite old.
It does showcase that LLMs don't truly "think" when it's not even able to search for and find the things mentioned. But, even then this configuration has been stable for years and the training data should have plenty of mentions.
It's not really magic, in my project folder I will git clone the source code of whatever I'm working on. I will put something in the the local md file like:
Use ./home-assistant/core for the source code of home assistant, its the same version that I'm running. Always search and reference the source when debugging a problem.
I also have it frequently do deep dives into source code on a particular problem and write a detailed md file so it only needs to do that once.
"Deep dive into this code, find everything you can find about automations and then write a detailed analysis doc with working examples and source code, use the source code."
It depends on your definition of safe. Most of the code that gets written is pretty simple — basic crud web apps, WP theme customization, simple mobile games… stuff that can easily get written by the current gen of tooling. That already has cost a lot of people a lot of money or jobs outright, and most of them probably haven’t reached their skill limit a as developers.
As the available work increases in complexity, I reckon more will push themselves to take jobs further out of their comfort zone. Previously, the choice was to upskill for the challenge and greater earnings, or stay where you are which is easy and reliable; the current choice is upskill or get a new career. Rather than switch careers to something you have zero experience in. That puts pressure on the moderately higher-skill job market with far fewer people, and they start to upskill to outrun the implosion, which puts pressure on them to move upward, and so on. With even modest productivity gains in the whole industry, it’s not hard for me to envision a world where general software development just isn’t a particularly valuable skill anymore.
Everything in tech is cyclical. AI will be no different. Everyone outsourced, realized the pain and suffering and corrected. AI isn't immune to the same trajectory or mistakes. And as corporations realize that nobody has a clue about how their apps or infra run, you're one breach away from putting a relatively large organization under.
The final kicker in this simple story is that there are many, many narcissistic folks in the C-suite. Do you really think Sam Altman and Co are going to take blame for Billy's shitty vibe coded breach? Yeah right. Welcome to the real world of the enterprise where you still need an actual throat to choke to show your leadership skills.
I absolutely don’t think vibe coding or barely supervised agents will replace coders, like outsourcing claimed to, and in some cases did and still does. And outsourcing absolutely affected the job market. If the whole thing does improve and doesn’t turn out to be too wildly unprofitable to survive, what it will do is allow good quality coders— people who understand what can and can’t go without being heavily scrutinized— to do a lot more work. That is a totally different force than outsourcing, which to some extent, assumed software developers were all basically fungible code monkeys at some level.
There's a lot to unpack here. I agree - outsourcing did affect the job market. You're just seeing the negative (US) side. If anything outsourcing was hugely beneficial to the Indian market where most of those contracts landed. My point was that it was sold as a solution that didn't net the value proposition it claimed. And that is why I've said AI is not immune to being cyclical, just like outsourcing. AI is being sold as worker replacement. It's not even close and if it were then OpenAI, Anthropic and Google would have all replaced a lot of people and wouldn't be allowing you and I to use their tool for $20/month. When it does get that good we will no longer be able to afford using these "enterprise" tools.
With respect to profitability - there's none in sight. When JP Morgan [0] is saying that $650B in annual revenue is needed to make a paltry 10% on investment there is no way any sane financial institution would pump more money into that sunk cost. Yet, here we are building billions of dollars in datacenters for what... Mediocre chat bots? Again these thing don't think. They don't reason. They're massive word graphs being used in clever ways with cute, humanizing descriptions. Are they useful for helping a human parse way more information than we can reason about at once? For sure! But that's not worth trillions in investment and won't yield multiples of the input. In fact I'd argue the AI landscape would be much better off if the dollars stopped flowing because that would mean real research would need to be done in a much more efficient and effective manner. Instead we're paying individual people hundreds of millions of dollars who, and good for them, have no clue or care on what actually happens with AI because: money in the bank. No, AI in it's current form is not profitable, and it's not going to be if we continue down this path. We've literally spent world changing sums of money on models that are used to create art that will displace the original creators well before they will solve any level of useful world problems.
Finally, and to your last point: "...good quality coders...". How long do you think that will be a thing with respect to how this is all unfolding? Am I writing better code (I'm not a programmer by day) with LLMs? Yes and no. Yes when I need to build a visually appealing UI for something. And yes when it comes to a framework. But what I've found is if I don't put all of the right pieces in the right places before I start I end up with an untenable mess into the first couple thousand lines of that code. So if people stop becoming "good quality programmers" then what? These models only get better with better training data and the web will continue to go insular against these IP stealing efforts. The data isn't free, it never has been. And this is why we're now hearing the trope of "world models". A way to ask for trillions more to provide millionths of a penny on the invested dollar.