Marketing is being done really well in 2025, with brands injecting themselves in...

joshvince · 2025-10-14T08:09:00 1760429340

> if you’re working on novel code, LLMs are absolutely horrible

This is spot on. Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

What they are genuinely impressive at is parsing through large amounts of information to find something (eg: in a codebase, or in stack traces, or in logs). But this hype machine of 'agents creating entire codebases' is surely just smoke and mirrors - at least for now.

ehnto · 2025-10-14T08:44:57 1760431497

> at least for now.

I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.

Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.

My thinking here is that we already had the technologies of the LLMs and the compute, but we hadn't yet had the reason and capital to deploy it at this scale.

So the surprising innovation of transformers did not give us the boost in capability itself, it still needed scale. The marketing that enabled the capital, that enables that scale was what caused the insane growth, and capital can't grow forever, it needs returns.

Scale has been exponential, and we are hitting an insane amount of capital deployment for this one technology that, has yet to prove commercially viable at the scale of a paradigm shift.

Are businesses that are not AI based, actually seeing ROI on AI spend? That is really the only question that matters, because if that is false, the money and drive for the technology vanishes and the scale that enables it disappears too.

delusional · 2025-10-14T09:04:52 1760432692

> Yes it advanced extremely quickly, but that is not a confirmation of anything. It could just be the technology quickly meeting us at either our limit of compute, or it's limit of capability.

To comment om this, because its the most common counter argument. Most technology has worked in steps. We take a step forward, then iterate on essentially the same thing. It's very rare we see order of magnitude improvement on the same fundamental "step".

Cars were quite a step forward from donkeys, but modern cars are not that far off from the first ones. Planes were an amazing invention, but the next model of plane is basically the same thing as the first one.

ehnto · 2025-10-14T09:10:19 1760433019

I agree, I think we are in the latter phase already. LLMs were a huge leap in machine learning, but everything after has been steps on top + scale.

I think we would need another leap to actually meet the markets expectations on AI. The market is expecting AGI, but I think we are probably just going to do incremental improvements for language and multi modal models from here, and not meet those expectations.

I think the market is relying on something that doesn't currently exist to become true, and that is a bit irrational.

zamalek · 2025-10-14T09:57:24 1760435844

Transformers aren't it, though. We need a new fundamental architecture and, just like every step forward in AI that came before, when that happens is a completely random event. Some researcher needs to wake up with a brilliant idea.

The explosion of compute and investment could mean that we have more researchers available for that event to happen, but at the same time transformers are sucking up all the air in the room.

alganet · 2025-10-14T12:32:10 1760445130

Several people hinted at the limits this technology was about to face, including training data and compute. It was obvious it had serious limits.

Despite the warnings, companies insisted on marketing superintelligence nonsense and magic automatic developers. They convinced the market with disingenous demonstrations, which, again, were called out as bullshit by many people. They are still doing it. It's the same thing.

bccdee · 2025-10-14T14:18:56 1760451536

> Yes it advanced extremely quickly

The things that impress me about gpt-5 are basically the same ones that impressed me about gpt-3. For all the talk about exponential growth, I feel like we experienced one big technical leap forward and have spent the past 5 years fine-tuning the result—as if fiddling with it long enough will turn it into something it is not.

timmytokyo · 2025-10-14T17:02:10 1760461330

When building their LLMs, the model makers consumed the entire internet. This allowed the models to improve exponentially fast. But there's no more internet to consume. Yes, new data is being generated, but not at anywhere near the rate the models were growing in capability just a year ago. That's why we're seeing diminishing returns when comparing, say, GPT-5 to GPT-4.

The AI marketers, accelerationists and doomers may seem to be different from one another, but the one thing they have in common is their adherence to an extrapolationist fallacy. They've been treating the explosion of LLM capabilities as a promise of future growth and capability, when in fact it's all an illusion. Nothing achieves indefinite exponential growth. Everything hits a wall.

wkat4242 · 2025-10-14T10:24:49 1760437489

> Yes it advanced extremely quickly,

It did but it's kinda stagnated now especially on the LLM front. The time when ever week a groundbreaking model came out is over for now. Later revisions of existing models, like GPT5 and llama4 have been underwhelming.

CuriouslyC · 2025-10-14T11:07:05 1760440025

GPT5 may have been underwhelming to _you_. Understand that they're heavily RLing to raise the floor on these models, so they might not be magically smarter across the board, there are a LOT of areas where they're a lot better that you've probably missed because they're not your use case.

bangaroo · 2025-10-14T12:42:04 1760445724

every time i say "the tech seems to be stagnating" or "this model seems worse" based on my observations i get this response. "well, it's better for other use cases." i have even heard people say "this is worse for the things i use it for, but i know it's better for things i don't use it for."

i have yet to hear anyone seriously explain to me a single real-world thing that GPT5 is better at with any sort of evidence (or even anecdote!) i've seen benchmarks! but i cannot point to a single person who seems to think that they are accomplishing real-world tasks with GPT5 better than they were with GPT4.

the few cases i have heard that venture near that ask may be moderately intriguing, but don't seem to justify the overall cost of building and running the model, even if there have been marginal or perhaps even impressive leaps in very narrow use cases. one of the core features of LLMs is they are allegedly general-purpose. i don't know that i really believe a company is worth billions if they take their flagship product that can write sentences, generate a plan, follow instructions and do math and they are constantly making it moderately better at writing sentences, or following instructions, or coming up with a plan and it consequently forgets how to do math, or becomes belligerent, or sycophantic, or what have you.

to me, as a user with a broad range of use cases (internet search, text manipulation, deep research, writing code) i haven't seen many meaningful increases in quality of task execution in a very, very long time. this tracks with my understanding of transformer models, as they don't work in a way that suggests to me that they COULD be good at executing tasks. this is why i'm always so skeptical of people saying "the big breakthrough is coming." transformer models seem self-limiting by merit of how they are designed. there are features of thought they simply lack, and while i accept there's probably nobody who fully understands how they work, i also think at this point we can safely say there is no superintelligence in there to eke out and we're at the margins of their performance.

the entire pitch behind GPT and OpenAI in general is that these are broadly applicable, dare-i-say near-AGI models that can be used by every human as an assistant to solve all their problems and can be prompted with simple, natural language english. if they can only be good at a few things at a time and require extensive prompt engineering to bully into consistent behavior, we've just created a non-deterministic programming language, a thing precisely nobody wants.

48terry · 2025-10-14T14:32:56 1760452376

The simple explanation for all this, along with the milquetoast replies kasey_junk gave you, is that to its acolytes, AI and LLMs cannot fail, only be failed.

If it doesn't seem to work very well, it's because you're obviously prompting it wrong.

If it doesn't boost your productivity, either you're the problem yourself, or, again, you're obviously using it wrong.

If progress in LLMs seems to be stagnating, you're obviously not part of the use cases where progress is booming.

When you have presupposed that LLMs and this particular AI boom is definitely the future, all comments to the contrary are by definition incorrect. If you treat it as a given that this AI boom will succeed (by some vague metric of "success") and conquer the world, skepticism is basically a moral failing and anti-progress.

The exciting part about this belief system is how little you actually have to point to hard numbers and, indeed, rely on faith. You can just entirely vibe it. It FEELS better and more powerful to you, your spins on the LLM slot machine FEEL smarter and more usable, it FEELS like you're getting more done. It doesn't matter if those things are actually true over the long run, it's about the feels. If someone isn't sharing your vibes about the LLM slot machine, that's entirely their fault and problem.

mwigdahl · 2025-10-14T15:29:42 1760455782

And on the other side, to detractors, AI and LLMs cannot ever succeed. There's always another goalpost to shift.

If it seems to work well, it's because it's copying training data. Or it sometimes gets something wrong, so it's unreliable.

If they say it boosts their productivity, they're obviously deluded as to where they're _really_ spending time, or what they were doing was trivial.

If they point to improvements in benchmarks, it's because model vendors are training to the tests, or the benchmarks don't really measure real-world performance.

If the improvements are in complex operations where there aren't benchmarks, their reports are too vague and anecdotal.

The exciting part about this belief system is how little you have to investigate the actual products, and indeed, you can simply rely on a small set of canned responses. You can just entirely dismiss reports of success and progress; that's completely due to the reporter's incompetence and self-delusion.

wkat4242 · 2025-10-14T22:21:48 1760480508

I work in a company that's "all in on AI" and there's so much BS being blown up just because they can't have it fail because all the top dogs will have mud on their faces. They're literally just faking it. Just making up numbers, using biased surveys, making sure employees know it's being "appreciated" if they choose option A "Yes AI makes me so much more productive" etc.

This is definitely something that biases me against AI, sure. Seeing how the sausage is made doesn't help. Because it's really a lot of offal right now especially where I work.

I'm a very anti-corporate non-teamplayer kinda person so I tend to be highly critical, I'll never just go along with PR if it's actually false. I won't support my 'team' if it's just wrong. Which often rubs people the wrong way at work. Like when I emphasised in a training that AI results must be double checked. Or when I answered in an "anonymous" survey that I'd rather have a free lunch than "copilot" and rated it a 2 out of 5 in terms of added value (I mean, at the time it didn't even work in some apps)

But I'm kinda done with soul-killing corporatism anyway. Just waiting for some good redundancy packages when the AI bubble collapses :)

hadlock · 2025-10-14T18:04:34 1760465074

> If they say it boosts their productivity, they're obviously deluded as to where they're _really_ spending time, or what they were doing was trivial.

A pretty substantial number of developers are doing trivial edits to business applications all over the globe, pretty much continuously. At least in the low to mid double digits %

bangaroo · 2025-10-14T17:46:04 1760463964

wouldn't call myself a detractor. i wouldn't call it a belief system i hold (i am an engineer 20 years into my career and would love to automate away the tedious parts of my job i've done a thousand times) as it is a position i hold based on the evidence i've seen in front of me.

i constantly hear that companies are running with "50% of their code written by AI!" but i've yet to meet an engineer who says they've personally seen this. i've met a few who say they see it through internal reporting, though it's not the case on their team. this is me personally! i'm not saying these people don't exist. i've heard it much more from senior leadership types i've met in the field - directors, vps, c-suite, so on.

i constantly hear that AI can do x, y, or z, but no matter how many people i talk to or how much i or my team works towards those goals, it doesn't really materialize. i can accept that i may be too stupid (though i'd argue that if that's the problem, the AI isn't as good as claimed) but i work with some brilliant people and if they can't see results, that means something to me.

i see people deploying the tool at my workplace, and recently had to deal with a situation where leadership was wondering why one of our top performers had slowed down substantially and gotten worse, only to find that the timeline exactly aligned with them switching to cursor as their IDE.

i read papers - lots of papers - and articles about both positive and negative assertions about LLMs and their applicability in the field. i don't feel like i've seen compelling evidence in research not done by the foundation model companies that supports the theory this is working well. i've seen lots of very valid and concerning discoveries reported by the foundation model companies, themselves!

there are many places in the world i am a hardliner on no generative AI and i'll be open about that - i don't want it in entertainment, certainly not in music, and god help me if i pick up the phone and call a company and an agent picks up.

for my job? i'm very open to it. i know the value i provide above what the technology could theoretically provide, i've written enough boilerplate and the same algorithms and approaches for years to prove to myself i can do it. if i can be as productive with less work, or more productive with the same work? bring it on. i am not worried about it taking my job. i would love it to fulfill its promise.

i will say, however, that it is starting to feel telling that when i lay out any sort of reasoned thought on the issue that (hopefully) exposes my assumptions, biases, and experiences, i largely get vague, vibes-based answers, unsourced statistics, and responses that heavily carry the implication that i'm unwilling to be convinced or being dogmatic. i very rarely get thoughtful responses, or actual engagement with the issues, concerns, or patterns i write about. oftentimes refutations of my concerns or issues with the tech are framed as an attack on my willingness to use or accept it, rather than a discussion of the technology on its merits.

while that isn't everything, i think it says something about the current state of discussion around the technology.

48terry · 2025-10-14T16:38:47 1760459927

You really thought you had a post with this one huh. I have second-hand embarrassment for you.

kasey_junk · 2025-10-14T12:50:27 1760446227

Claude Sonnet 4.5 is _way_ better than previous sonnets and as good as Opus for the coding and research tasks I do daily.

I rarely use Google search anymore, both because llms got that ability embedded and the chatbots are good at looking through the swill search results have become.

bangaroo · 2025-10-14T13:08:41 1760447321

"it's better at coding" is not useful information, sorry. i'd love to hear tangible ways it's actually better. does it still succumb to coding itself in circles, taking multiple dependencies to accomplish the same task, applying inconsistent, outdated, or non-idiomatic patterns for your codebase? has compliance with claude.md files and the like actually improved? what is the round trip time like on these improvements - do you have to have a long conversation to arrive at a simple result? does it still talk itself into loops where it keeps solving and unsolving the same problems? when you ask it to work through a complex refactor, does it still just randomly give up somewhere in the middle and decide there's nothing left to do? does it still sometimes attempt to run processes that aren't self-terminating to monitor their output and hang for upwards of ten minutes?

my experience with claude and its ilk are that they are insanely impressive in greenfield projects and collapse in legacy codebases quickly. they can be a force multiplier in the hands of someone who actually knows what they're doing, i think, but the evidence of that even is pretty shaky: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

the pitch that "if i describe the task perfectly in absolute detail it will accomplish it correctly 80% of the time" doesn't appeal to me as a particularly compelling justification for the level of investment we're seeing. actually writing the code is the simplest part of my job. if i've done all the thinking already, i can just write the code. there's very little need for me to then filter that through a computer with an overly-verbose description of what i want.

as for your search results issue: i don't entirely disagree that google is unusable, but having switched to kagi... again, i'm not sure the order of magnitude of complexity of searching via an LLM is justified? maybe i'm just old, but i like a list of documents presented without much editorializing. google has been a user-hostile product for a long time, and its particularly recent quality collapse has been well-documented, but this seems a lot more a story of "a tool we relied on has gotten measurably worse" and not a story of "this tool is meaningfully better at accomplishing the same task." i'll hand it to chatgpt/claude that they are about as effective as google was at directing me to the right thing circa a decade ago, when it was still a functional product - but that brings me back to the point that "man, this is a lot of investment and expense to arrive at the same result way more indirectly."

kasey_junk · 2025-10-14T13:17:57 1760447877

You asked for a single anecdote of llms getting better at daily tasks. I provided two. You dismissed them as not valuable _to you_.

It’s fine that your preferences aren’t aligned such that you don’t value the model or improvements that we’ve seen. It’s troubling that you use that to suggest there haven’t been improvements.

bangaroo · 2025-10-14T13:20:44 1760448044

you didn't provide an anecdote. you just said "it's better." an anecdote would be "claude 4 failed in x way, and claude 4.5 succeeds consistently." "it is better" is a statement of fact with literally no support.

the entire thrust of my statement was "i only hear nonspecific, vague vibes that it's better with literally no information to support that concept" and you replied with two nonspecific, vague vibes. sorry i don't find that compelling.

"troubling" is a wild word to use in this scenario.

kasey_junk · 2025-10-14T13:30:09 1760448609

My one shot rate for unattended prompts (triggered GitHub actions) has gone from about 2 in 3 to about 4 in 5 with my upgrade to 4.5 in the codebase I program in the most (one built largely pre-ai). These are highly biased to tasks I expect ai to do well.

Since the upgrade I don’t use opus at all for planning and design tasks. Anecdotally, I get the same level of performance on those because I can choose the model and I don’t choose opus. Sonnet is dramatically cheaper.

What’s troubling is that you made a big deal about not hearing any stories of improvements as if your bar was very low for said stories, then immediately raised the bar when given them. It means that one doesn’t know what level of data you actually want.

Retric · 2025-10-14T13:52:01 1760449921

Requesting concrete examples isn’t a high bar. Autopilot got better tells me effectively nothing. Autopilot can now handle stoplights does.

kasey_junk · 2025-10-14T14:14:38 1760451278

“ but i cannot point to a single person who seems to think that they are accomplishing real-world tasks with GPT5 better than they were with GPT4.”

I don’t use OpenAI stuff but I seem to think Claude is getting better for accomplishing the real world tasks I ask of it.

Retric · 2025-10-14T17:15:20 1760462120

Specifics are worth talking about. I just felt it unfair to complain about raising the bar when you didn’t initially reach it.

In your own worlds: “You asked for a single anecdote of llms getting better at daily tasks.”

Which is already less specific than their request: “i'd love to hear tangible ways it's actually better.”

Saying “is getting better for accomplishing the real world tasks I ask of it” brings nothing to a discussion and was the kind of vague statement that they were initially complaining about. If LLM’s are really improving it’s not a major hurdle to say something meaningful about what specific is getting better. /tilting at windmills

mwigdahl · 2025-10-14T15:21:23 1760455283

Here's one. I have a head to head "benchmark" involving generating a React web app to display a Gantt chart, add tasks, layer overlaps, read and write to files, etc. I compared implementing this application using both Claude Code with Opus 4.1 / Sonnet 4 (scenario 1) and Claude Code 2 with Sonnet 4.5 (scenario 2) head to head.

The scenario 1 setup could complete the application but it had about 3 major and 3 minor implementation problems. Four of those were easily fixed by pointing them out, but two required significant back and forth with the model to resolve.

The scenario 2 setup completed the application and there were four minor issues, all of which were resolved with one or two corrective prompts.

Toy program, single run through, common cases, stochastic parrot, yadda yadda, but the difference was noticeable in this direct comparison and in other work I've done with the model I see a similar improvement.

Take from that what you will.

bangaroo · 2025-10-14T15:38:57 1760456337

so to clarify your case, you are having it generate a new application, from scratch, and then benchmarking the quality of the output and how fast it got to the solution you were seeking?

i will concede that in this arena, there does seem to be meaningful improvement.

i said this in one of my comments in this thread, but the place i routinely see the most improvement in output from LLMs (and find they perform best) for code generation is in green field projects, particularly ones whose development starts with an agent. some facts that make me side-eye this result (not yours in particular, just any benchmark that follows this model):

- the codebase, as long as a single agent and model are working on it, is probably suited to that model's biases and thus implicitly easier for it to work in and "understand."

- the codebase is likely relatively contained and simple.

- the codebase probably doesn't cross domains or require specialized knowledge of services or APIs that aren't already well-documented on the internet or weren't built by the tool.

these are definitely assumptions, but i'm fairly confident in their accuracy.

one of the key issues i've had approaching these agents is that all my "start with an LLM and continue" projects actually start incredibly impressively! i was pretty astounded even on the first version of claude code - i had claude building a service, web management interface AND react native app, in concert, to build an entire end to end application. it was great! early iteration was fast, particularly in the "mess around and find out what happens" phase of development.

where it collapsed, however, was when the codebase got really big, and when i started getting very opinionated about outcomes. my claude.md file grew and grew and seemed to enforce less and less behavior, and claude became less and less likely to successfully refactor or reuse code. this also tracks with my general understanding of what an LLM may be good or bad at - it can only hold so much context, and only as textual examples, not very effectively as concepts or mental models. this ultimately limits its ability to reason about complex architecture. it rapidly became faster for me to just make the changes i envisioned, and then claude became more of a refactoring tool that i very narrowly applied when i was too lazy to do the text wrangling myself.

i do believe that for rapid prototyping - particularly the case of "product manager trying to experiment and figure out some UX" - these tools will likely be invaluable, if they can remain cost effective.

the idea that i can use this, regularly, in the world of "things i do in my day-to-day job" seems a lot more far fetched, and i don't feel like the models have gotten meaningfully better at accomplishing those tasks. there's one notable exception of "explaining focused areas of the code", or as a turbo-charged grep that finds the area in the codebase where a given thing happens. i'd say that the roughly 60-70% success rate i see in those tasks is still a massive time savings to me because it focuses me on the right thing and my brain can fill in the rest of the gaps by reading the code. still, i wouldn't say its track record is phenomenal, nor do i feel like the progress has been particularly quick. it's been small, incremental improvements over a long period of time.

i don't doubt you've seen an improvement in this case (which is, as you admit, a benchmark) but it seems like LLMs keep performing better on benchmarks but that result isn't, as far as i can see, translating into improved performance on the day-to-day of building things or accomplishing real-world tasks. specifically in the case of GPT5, where this started, i have heard very little if any feedback on what it's better at that doesn't amount to "some things that i don't do." it is perfectly reasonable to respond to me that GPT5 is a unique flop, and other model iterations aren't as bad, in that case. i accept this is one specific product from one specific company - but i personally don't feel like i'm seeing meaningful evidence to support that assertion.

mwigdahl · 2025-10-14T18:37:00 1760467020

Thank you for the thoughtful response. I really appreciate the willingness to discuss what you've seen in your experience. I think your observations are pretty much exactly correct in terms of where agents do best. I'd qualify in just a couple areas:

1. In my experience, Claude Code (I've used several other models and tools, but CC performs the best for me so that's my go-to) can do well with APIs and services that are proprietary as long as there's some sort of documentation for them it can get to (internal, Swagger, etc.), and you ensure that the model has that documentation prominently in context.

2. CC can also do well with brownfield development, but the scope _has_ to be constrained, either to a small standalone program or a defined slice of a larger application where you can draw real boundaries.

The best illustration I've seen of this is in a project that is going through final testing prior to release. The original "application" (I use the term loosely) was a C# DLL used to generate data-driven prescription monitoring program reporting.

It's not ultra-complicated but there's a two step process where you retrieve the report configuration data, then use that data to drive retrieval and assembly of the data elements needed for the final report. Formatting can differ based on state, on data available (reports with no data need special formatting), and on whether you're outputting in the context of transmission or for user review.

The original DLL was written in a very simplistic way, with no testing and no way to exercise the program without invoking it from its link points embedded in our main application. Fixing bugs and testing those fixes were both very painful as for production release we had to test all 50 states on a range of different data conditions, and do so by automating the parent application.

I used Claude Code to refactor this module, add DI and testing, and add a CLI that could easily exercise the logic in all different supported configurations. It took probably $50 worth of tokens (this was before I had a Max account, so it was full price) over the course of a few hours, most of which time I was in other meetings.

The final result did exhibit some classic LLM problems -- some of the tests were overspecific, it restructured without always fully cleaning up the existing functions, and it messed up a couple of paths through the business logic that I needed to debug and fix. But it easily saved me a couple days of wrestling with it myself, as I'm not super strong with C#. Our development teams are fully committed, and if I hadn't used CC for this it wouldn't have gotten done at all. Being able to run this on the side and get a 90% result I could then take to the finish line has real value for us, as the improved testing alone will see an immediate payback with future releases.

This isn't a huge application by any means, but it it's one example of where I've seen real value that is hitting production, and seems representative of a decently large category of line-of-business modules. I don't think there's any reason this wouldn't replicate on similarly-scoped products.

theshrike79 · 2025-10-14T21:21:08 1760476868

The biggest issue with Sonnet 4.5 is that it's chatty as fuuuck. It just won't shut up, it keeps producing massive markdown "reports" and "summaries" of every single minor change, wasting precious context.

With Sonnet 4 I rarely ran out of quota unexpectedly, but 4.5 chews through whatever little Anthropic gives us weekly.

orwin · 2025-10-14T18:27:29 1760466449

Gpt5 isn't an improvement to me, but Claude sonnet4.5, handle terragrunt way, way better than the previous version did. It also go search AWS documentation by itself, and parse external documents way better. That's not LLM improvement, to be clear (except the terragrunt thing), I think it's improvement in data acquisition and a better inference engine. On react project it seems way, way less messy also, I have to use it more but the inference engine seems clearer. At least less prone to circular code, where it's stuck in a loop. It seems to be exiting the loop faster, even when the output isn't satisfactory (which isn't an issue to me, most of my prompt have more or less 'only write functions template, do not write the inside logic if it has to contain more than a loop', I fill the blanks myself)

maddmann · 2025-10-14T11:52:27 1760442747

I’m curious what you are expecting when you say progress has stagnated?

ludicrousdispla · 2025-10-14T19:12:40 1760469160

>> The marketing that enabled the capital, that enables that scale was what caused the insane growth, and capital can't grow forever,

Striking parallels between AI and food delivery (uber eats, deliveroo, lieferando, etc.) ... burn capital for market share/penetration but only deliver someone else's product with no investment to understand the core market for the purpose of developing a better product.

NitpickLawyer · 2025-10-14T09:31:19 1760434279

> I know I could be eating my words, but there is basically no evidence to suggest it ever becomes as exceptional as the kingmakers are hoping.

??? It has already become exceptional. In 2.5 years (since chatgpt launched) we went from "oh, look how cute this is, it writes poems and the code almost looks like python" to "hey, this thing basically wrote a full programming language[1] with genz keywords, and it mostly works, still has some bugs".

I think the goalpost moving is at play here, and we quickly forget how 1 year makes a huge difference (last year you needed tons of glue and handwritten harnesses to do anything - see aider) and today you can give them a spec and get a mostly working project (albeit with some bugs), 50$ later.

[1] - https://github.com/ghuntley/cursed

ehnto · 2025-10-14T09:55:11 1760435711

I don't disagree with you on the technology, but mostly my comment is about what the market is expecting. With such a huge capex expenditure it is expecting a huge returns. Given AI has not proven consistent ROI generally for other enterprises (as far as I know), they are hoping for something better than what is right now and they are hoping for it to happen before the money runs out.

I am not saying it's impossible, but there is no evidence that the leap in technology to reach wild profitability (replacing general labour) such investment desires is just around the corner either.

baxtr · 2025-10-14T10:00:51 1760436051

After 3 years, I would like to see pathways.

Let say we found a company that already realized 5-10% of savings in the first step. Now, based on this we might be able to map out the path to 25-30% savings in 5% steps for example.

I personally haven’t seen this, but I might have missed it as well.

mikkupikku · 2025-10-14T10:45:11 1760438711

Three years? One year ago I tried using LLMs for coding and found it to be more trouble than it was worth, no benifit in time spent or effort made. It's only within the past several months that this gas changed, IMHO.

Izkata · 2025-10-14T12:39:10 1760445550

To phrase this another way, using old terms: We seem to be approaching the uncanny valley for LLMs, at which point the market overall will probably hit the trough of disillusionment.

CuriouslyC · 2025-10-14T11:08:08 1760440088

It doesn't really matter what the market is expecting at this point, the president views AI supremacy as non-negotiable. AI is too big to fail.

luhsprwhkv2 · 2025-10-14T11:21:05 1760440865

It’s true, but not just the presidency. The whole political class is convinced that this is the path out of all their problems.

danaris · 2025-10-14T11:53:43 1760442823

...Is it the whole political class?

Or is it the whole political party?

ehnto · 2025-10-15T02:37:04 1760495824

I am not from the US, but your administration could still fumble the AI bust even if it wants to avoid it. Who knows maybe they are hoping to short it.

mikkupikku · 2025-10-14T10:02:23 1760436143

That there is a bubble is absolutely certain. If for no other reason, than because investors don't understand the technology and don't know which companies are for real and which are essentially scams, they dump money into anything with the veneer of AI and hope some of it sticks. We're replaying the dotcom bubble, a lot of people are going to get burned, a lot of companies will turn out to be crap. But at the end of the dotcom crash we had some survivors standing above the rest and the whole internet thing turned out to have considerable staying power. I think the same will happen with AI, particularly agentic coding tools. The technology is real and will stick with us, even after the bubble and crash.

Balinares · 2025-10-14T11:42:51 1760442171

I feel like the invention of MCP was a lot more instrumental to that than model upgrades proper. But look at it as a good thing, if you will: it shows that even if models are plateauing, there's a lot of value to unlock through the tooling.

NitpickLawyer · 2025-10-14T11:47:49 1760442469

> it shows that even if models are plateauing,

The models aren't plateauing (see below).

> invention of MCP was a lot more instrumental [...] than model upgrades proper

Not clear. The folks at hf showed that a minimal "agentic loop" in 100 LoC [1] that gives the agent "just bash access" still got very close to SotA with all the bells and whistles (and surpassed last year models w/ handcrafted harnesses).

[1] - https://github.com/SWE-agent/mini-swe-agent

theshrike79 · 2025-10-14T21:27:49 1760477269

Small focused (local) model + tooling is the future, not online LLMs with monthly costs. Your coding model doesn't need all of the information in the world built in, it needs to know code and have tools available to get any information it needs to complete its tasks. We have treesitter, MCPs, LSPs, etc - use them.

The problem is that all the billions (trillions?) of VC money go to the online models because they're printing money at this point.

There's no money to be made in creating models people can run locally for free.

StilesCrisis · 2025-10-14T13:00:30 1760446830

I mean, that's still proving the point that tooling matters. I don't think his point was "MCP as a technology is extraordinary" because it's not.

kordlessagain · 2025-10-14T16:53:01 1760460781

MCP is a marketing ploy, not an “invention”.

dragonwriter · 2025-10-14T16:57:19 1760461039

It is an actual invention that has concrete function, whether or not it was part of a marketing push.

hitarpetar · 2025-10-14T11:48:53 1760442533

I didn't realize generating the gen-z programming language was a goalpost in the first place

walleeee · 2025-10-14T11:45:25 1760442325

The question in your last paragraph is not the only one that matters. Funding the technology at a material loss will not be off the table. Think about why.

FromTheFirstIn · 2025-10-14T12:04:48 1760443488

Just tell us why you think funding at a loss at this scale is viable, don’t smugly assign homework

walleeee · 2025-10-14T13:52:32 1760449952

Apologies, not meant to be smug

48terry · 2025-10-14T14:35:26 1760452526

...But you did fully intend to assign homework? Why are you even commenting, what are you adding?

wongarsu · 2025-10-14T08:38:37 1760431117

I have had LLMs write entire codebases for me, so it's not like the hype is completely wrong. It's just that this only works if what you want is "boring", limited in scope and on a well-trodden path. You can have an LLM create a CRUD application in one go, or if you want to sort training data for image recognition you can have it generte a one-off image viewer with shortcuts tailored to your needs for this task. Those are powerful things and worthy of some hype. For anything more complex you very quickly run into limits and the time and effort to do it with an LLM quickly approaches the time and effort required to do it by hand.

physicsguy · 2025-10-14T09:23:03 1760433783

They're powerful, but my feeling is that largely you could do this pre-LLM by searching on Stack Overflow or copying and pasting from the browser and adapting those examples, if you knew what you were looking for. Where it adds power is adapting it to your particular use case + putting it in the IDE. It's a big leap but not as enormous a leap as some people are making out.

Of course, if you don't know what you are looking for, it can make that process much easier. I think this is why people at the junior end find it is making them (a claimed) 10x more productive. But people who have been around for a long time are more skeptical.

disgruntledphd2 · 2025-10-14T10:24:20 1760437460

> Where it adds power is adapting it to your particular use case + putting it in the IDE. It's a big leap but not as enormous a leap as some people are making out.

To be fair, this is super, super helpful.

I do find LLMs helpful for search and providing a bunch of different approaches for a new problem/area though. Like, nothing that couldn't be done before but a definite time saver.

Finally, they are pretty good at debugging, they've helped me think through a bunch of problems (this is mostly an extension of my point above).

Hilariously enough, they are really poor at building MCP like stuff, as this is too new for them to have many examples in the training data. Makes total sense, but still endlessly amusing to me.

Izkata · 2025-10-14T12:42:43 1760445763

Why bother searching yourself? This is pre-LLM: https://github.com/drathier/stack-overflow-import

marcosdumay · 2025-10-14T15:40:12 1760456412

> Of course, if you don't know what you are looking for, it can make that process much easier.

Yes. My experience is that LLMs are really, really good at understanding what you are trying to say and bringing up the relevant basic information. That's a task we call "search", but it is different from the focused search people do most of the time.

Anyway, by the nature of the problem, that's something that people should do only a few times for each subject. There is not a huge market opportunity there.

mikkupikku · 2025-10-14T10:08:55 1760436535

Doing it the old fashioned lazy way, copy-pasting snippets of code you search for on the internet and slightly modifying each one to fit with the rest of your code, would take me hours to achieve the kind of slop that claude code can one shot in five minutes.

Yeah yeah, call me junior or whatever, I have thick skin. I'm a lazy bastard and I no longer care about the art of the craft, I just want programs tailored to my tastes and agentic coding tools are by far the fastest way to get it. 10x doesn't even come close, it's more like 100x just on the basis of time alone. Effort? After the planning stage I kick back with video games while the tool works. Far better than 100x for effort.

alwahi · 2025-10-14T09:10:39 1760433039

i have seen so many people say that, but the app stores/package managers aren't being flooded with thousands of vibe coded apps, meanwhile facebook is basically ai slop. can you share your github? or a gist of some of these "codebases"

mikkupikku · 2025-10-14T09:46:39 1760435199

You seem critical of people posting AI slop on Facebook (so am I) but also want people to publish more AI slop software?

The AI slop software I've been making with Claude is intended for my own personal use. I haven't read most of the code and certainly wouldn't want to publish it under my own name. But it does work, it scratches my itches, fills my needs. I'm not going to publish the whole thing because that's a whole can of worms, but to hopefully satisfy your curiosity, here is the main_window.py of my tag-based file manager. It's essentially a CRUD application built with sqlite and pyside6. It doesn't do anything terribly adventurous, the most exciting it gets is keeping track of tag co-occurances so it can use naive Bayesian classifiers to recommend tags for files, order files by how likely they are to have a tag, etc.

Please enjoy. I haven't actually read this myself, only verified the behavior: https://paste.debian.net/hidden/c6a85fac

> "the app stores/package managers aren't being flooded with thousands of vibe coded apps"

The state of claude code presently is definitely good enough to churn out low effort shovelware. Insofar as that isn't evidently happening, I can only speculate about the reasons. In no order, it may be one or several of these reasons: Lots of developers feel threatened by the technology and won't give it a serious whirl. Non-developers are still stuck in the mindset of writing software being something they can't do. The general public isn't as aware of the existence of agentic coding tools as we on HN are. The appstores are being flooded with slop, as they always have been, and some of that slop is now AI slop, but doesn't advertise this fact, and the appstore algorithms generally do some work to suppress the visibility of slop anyway. Most people don't have good ideas for new software and don't have the reflex to develop new software to scratch their itches, instead they are stuck in the mentality of software consumers. Just some ideas..

kordlessagain · 2025-10-14T16:54:36 1760460876

It’s hardly slop when you have over a 100 different sources referenced in a targeted paper.

0xAFFFF · 2025-10-14T08:48:08 1760431688

> Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

Which makes sense, considering the absolutely massive amount of tutorials and basic HOWTOs that were present in the training data, as they are the easiest kind of programming content to produce.

amiga386 · 2025-10-14T10:35:26 1760438126

The purpose of an LLM is not to do your job, it's to do enough to convince your boss to sack you and pay the LLM company some portion of your salary.

To that end, it doesn't matter if it works or not, it just has to demo well.

motorest · 2025-10-14T16:11:48 1760458308

> Current state-of-the-art models are, in my experience, very good at writing boilerplate code or very simple architecture especially in projects or frameworks where there are extremely well-known opinionated patterns (MVC especially).

Yes, kind of. What you downplay as "extremely well-known opinionated patterns" actually means standard design patterns that are well established and tried-and-true. You know, what competent engineers do.

There's even a basic technique which consists of prompting agents to refactor code to clean it up to comply with best practices, as this helps agents evaluate your project as it lines them up with known patterns.

> What they are genuinely impressive at is parsing through large amounts of information to find something (eg: in a codebase, or in stack traces, or in logs).

Yes, they are. It helps if a project is well structured, clean, and follow best practices. Messy projects that are inconsistent and evolve as big balls of mud can and do judge LLMs to output garbage based on the garbage that was inputted. Once, while working on a particularly bad project, I noticed GPT4.1 wasn't even managing to put together consistent variable names for domain models.

> But this hype machine of 'agents creating entire codebases' is surely just smoke and mirrors - at least for now.

This really depends on what are your expectations. A glass half full perspective clearly points you to the fact that yes agents can and do create entire codebases. I know this to be a fact because I did it already just for shits and giggles. A glass half empty perspective however will lead people to nitpick their way into asserting agents are useless at creating code because they once prompted something to create a Twitter code and it failed to set the right shade of blue. YMMV and what you get out is proportional to the effort you put in.

vallavaraiyan · 2025-10-14T08:30:43 1760430643

What is novel code?

  1. LLM's would suck at coming up with new algorithms. 
  2. I wouldn't let an LLM decide how to structure my code. Interfaces, module boundaries etc

Other than that, given the right context (the sdk doc for a unique hardware for eg) and a well organised codebase explained using CLAUDE.Md they work pretty well in filling out implementations. Just need to resist the temptation to prompt while the actual typing would take seconds.

IX-103 · 2025-10-14T11:34:41 1760441681

Yep, LLMs are basically at the "really smart intern" level. Give them anything complex or that requires experience and they crash and burn. Give them a small, well-specified task with limited scope and they do reasonably well. And like an intern they require constant check-ins to make sure they're on track.

Of course with real interns you end up at the end with trained developers ready for more complicated tasks. This is useful because interns aren't really that productive if you consider the amount of time they take from experienced developers, so the main benefit is producing skilled employees. But LLMs will always be interns, since they don't grow with the experience.

vidarh · 2025-10-14T08:33:35 1760430815

My experience is opposite to yours. I have had Claude Code fix issues in a compiler over the last week with very little guidance. Occasionally it gets frustrating, but most of the time Claude Code just churns through issue after issue, fixing subtle code generation and parser bugs with very little intervention. In fact, most of my intervention is tool weaknesses in terms of managing compaction to avoid running out of context at inopportune moments.

It's implemented methods I'd have to look up in books to even know about, and shown that it can get them working. It may not do much truly "novel" work, but very little code is novel.

They follow instructions very well if structured right, but you can't just throw random stuff in CLAUDE.md or similar. The biggest issue I've run into recently is that they need significant guidance on process. My instructions tends to focus on three separate areas: 1) debugging guidance for a given project (for my compiler project, that means things like "here's how to get an AST dumped from the compiler" and "use gdb to debug crashes" (it sometimes did that without being told, but not consistently; with the instructions it usually does tht), 2) acceptance criteria - this does need reiteration, 3) telling it to run tests frequently, make small, testable changes, and to frequently update a detailed file outlining the approach to be taken, progress towards it, and any outcomes of investigation during the work.

My experience is that with those three things in place, I can have Claude run for hours with --dangerously-skip-permissions and only step in to say "continue" or do a /compact in the middle of long runs, with only the most superficial checks.

It doesn't always provide perfect code every step. But neither do I. It does however usually move in the right direction every step, and has consistently produced progress over time with far less effort on my behalf.

I wouldn't have it start from scratch without at least some scaffolding that is architecturally sound yet, but it can often do that too, though that needs review before it "locks in" a bad choice.

I'm at a stage where I'm considering harnesses to let Claude work on a problem over the course of days without human intervention instead of just tens of minutes to hours.

nosianu · 2025-10-14T08:51:31 1760431891

> My experience is opposite to yours.

But that is exactly the problem, no?

It is like, when you need some prediction (e.g. about market behavior), knowing that somewhere out there there is a person who will make the perfect one. However, instead of your problem being to make the prediction, now it is how to find and identify that expert. Is that type of problem that you converted yours into any less hard though?

I too had some great minor successes, the current products are definitely a great step forward. However, every time I start anything more complex I never know in advance if I end up with utterly unusable code, even after corrections (with the "AI" always confidently claiming that now it definitely fixed the problem), or something usable.

All those examples such as yours suffer from one big problem: They are selected afterwards.

To be useful, you would have to make predictions in advance and then run the "AI" and have your prediction (about its usefulness) verified.

Selecting positive examples after the work is done is not very helpful. All it does is prove that at least sometimes somebody gets something useful out of using an LLM for a complex problem. Okay? I think most people understand that by now.

PS/Edit: Also, success stories we only hear about but cannot follow and reproduce may have been somewhat useful initially, but by now most people are beyond that, willing to give it a try, and would like to have a link to the working and reproducible example. I understand that work can rarely be shared, but then those examples are not very useful any more at this point. What would add real value for readers of these discussions now is when people who say they were successful posted the full, working, reproducible example.

EDIT 2: Another thing: I see comments from people who say they did tweak CLAUDE.md and got it to work. But the point is predictability and consistency! If you have that one project where you twiddled around with the file and added random sentences that you thought could get the LLM to do what you need, that's not very useful. We already know that trying out many things sometimes yields results. But we need predictability and consistency.

We are used to being able to try stuff, and when we get it working we could almost always confidently say that we found the solution, and share it. But LLMs are not that consistent.

vidarh · 2025-10-14T12:47:25 1760446045

My point is that these are not minor successes, and not occasional. Not every attempt is equally successful, but a significant majority of my attempts are. Otherwise I wouldn't be letting it run for longer and longer without intervention.

For me this isn't one project where I've "twiddled around with the file and added random sentences". It's an increasingly systematic approach to giving it an approach to making changes, giving it regression tests, and making it make small, testable changes.

I do that because I can predict with a high rate of success that it will achieve progress for me at this point.

There are failures, but they are few, and they're usually fixed simply by starting it over again from after the last succesful change when it takes too long without passing more tests. Occasionally it requires me to turn off --dangerously-skip-permissions and guide it through a tricky part. But that is getting rarer and rarer.

No, I haven't formally documented it, so it's reasonable to be skeptical (I have however started packaging up the hooks and agents and instructions that consistently work for me on multiple projects. For now, just for a specific client, but I might do a writeup of it at some point) but at the same time, it's equally warranted to wonder whether the vast difference in reported results is down to what you suggest, or down to something you're doing differently with respect to how you're using these tools.

baq · 2025-10-14T11:27:30 1760441250

replace 'AI|LLM' with 'new hire' in your post for a funny outcome.

svieira · 2025-10-14T12:36:30 1760445390

Replace 'new hire' with 'AI|LLM' in the updated post for a very sad outcome.

marcosdumay · 2025-10-14T15:47:29 1760456849

New hires perform consistently. Even if you can't predict beforehand how well they'll work, after a short observation time you can predict very well how they will continue to work.

hitarpetar · 2025-10-14T11:51:25 1760442685

this is the first time I've ever seen this joke, well done!

kordlessagain · 2025-10-14T16:57:14 1760461034

You are using the wrong tools if you are getting crappy results. It’s like editing a photo with notepad, it’s possible but likely to fail.

fragmede · 2025-10-14T09:57:02 1760435822

I had a highly repetitive task (/subagents is great to know about), but I didn't get more advanced than a script that sent "continue\n" into the terminal where CC was running every X minutes. What was frustrating is CC was inconsistent with how long it would run. Needing to compact was a bit of a curveball.

vidarh · 2025-10-14T12:52:36 1760446356

The compaction is annoying, especially when it sometimes will then fail to compact with an error, forcing rewinding. They do need to tighten that up so it doesn't need so much manual intervention...

alwahi · 2025-10-14T09:12:09 1760433129

if claude generates the tests, runs those tests, applies the fixes without any oversight, it is a very "who watches the watchmen" situation.

vidarh · 2025-10-14T12:51:18 1760446278

That is true, so don't give it entirely free reign with that. I let Claude generate as many additional tests as it'd like, but I either produce high level tests, or review a set generated by Claude first, before I let it fill in the blanks, and it's instructed very firmly to see a specific set of test cases as critical, and then increasingly "boxed in" with more validated test cases as we go along.

E.g. for my compiler, I had it build scaffolding to make it possible to run rubyspecs. Then I've had it systematically attack the crashes and failures mostly by itself once the test suite ran.

ErikBjare · 2025-10-14T09:23:14 1760433794

If you generate the tests, run those tests, apply fixes without any oversight, it is the very same situation. In reality, we have PR reviews.

skydhash · 2025-10-14T11:47:05 1760442425

Is it? Stuff like ripgrep, msmpt,… are very much one-man project. And most packages on distro are maintained by only one person. Expertise is a thing and getting reliable results is what differentiates expert from amateurs.

fragmede · 2025-10-14T09:13:21 1760433201

Gemini?

gmb_uk · 2025-10-14T09:27:08 1760434028

Good lord, that would be like the blind leading the daft.

iLoveOncall · 2025-10-14T08:31:04 1760430664

> brands injecting themselves into conversations on Reddit, LinkedIn, and every other public forum.

Don't forget HackerNews.

Every single new release from OpenAI and other big AI firms attracts a lot of new accounts posting surface-level comments like "This is awesome" and then a few older accounts that have exclusively posted on previous OpenAI-related news to defend them.

It's glaringly obvious, and I wouldn't be surprised if at least a third of the comments on AI-related news is astroturfing.

jrflowers · 2025-10-14T08:48:07 1760431687

I personally always love the “I wrote an entire codebase with claud” posts where the response to “Can we see it?” is either the original poster disappearing into the mist until the next AI thread or “no I am under an NDA. My AI-generated code is so incredible and precious that my high-paying job would be at risk for disclosing it”

fragmede · 2025-10-14T10:49:34 1760438974

If anyone actually believed those requests to see code were sincere, or if they at least generated interesting discussion, people might actually respond. But the couple of times I've linked to a blog post someone wrote about their vibe-coding experience in the comments, someone invariably responds with an uninteresting shallow dismissal shitting all over the work. It didn't generate any interesting discussion, so I stopped bothering.

https://mitchellh.com/writing/non-trivial-vibing went round here recently, so clearly LLMs are working in some cases.

skydhash · 2025-10-14T11:57:21 1760443041

And I think, in this blog post, the author stated that he does heavy editing of what’s generated. So I don’t know how much time is saved actually. You can get the same kind of inspiration from docs, books, or some SO answer.

habinero · 2025-10-14T11:22:37 1760440957

Haters gonna hate, but the haters aren't always wrong. If you just want people to agree with you, that's not a discussion.

ponector · 2025-10-14T12:27:55 1760444875

And they are usually 10x more productive as well!

isodev · 2025-10-14T10:23:57 1760437437

NDA on AI generated code is funny since model outputs are technically someone else’s code. It’s amazing how we’re infusing all kinds of systems with potential license violations

benibela · 2025-10-14T13:18:33 1760447913

Someone posted these single file examples: https://github.com/joaopauloschuler/ai-coding-examples/tree/...

hkt · 2025-10-14T10:00:51 1760436051

Honestly I've generated some big ISH codebases with AI and have said so and then backed off when asked.. because a) I still want to try to establish more confidence in the codebase and b) my employment contract gleefully states everything I write belongs to my employer. Both of those things make me nervous.

That said, I have no doubt there are also bots setting out to generate FOMO

ponector · 2025-10-14T12:30:49 1760445049

Everything you wrote belongs to them. But it's not you, it's Claude is the author.

latexr · 2025-10-14T09:42:30 1760434950

Sam Altman would agree with you that those posts are bots and lament it, but would simultaneously remain (pretend to be?) absurdly oblivious about his own fault in creating that situation.

https://techcrunch.com/2025/09/08/sam-altman-says-that-bots-...

fransje26 · 2025-10-14T09:06:40 1760432800

> "This is awesome"

Or the "I created 30 different .md instruction files and AI model refactored/wrote from scratch/fixed all my bugs" trope.

> a third of the comments on AI-related news is astroturfing.

I wouldn't be surprised if it's even more than that.. And, ironically, probably aided in their astroturfing, by the capability of said models to spew out text..

loveparade · 2025-10-14T09:30:00 1760434200

> Beyond this, if you’re working on novel code, LLMs are absolutely horrible at doing anything. A lot of assumptions are made, non-existent libraries are used, and agents are just great at using tokens to generate no tangible result whatsoever.

Not my experience. I've used LLMs to write highly specific scientific/niche code and they did great, but obviously I had to feed them the right context (compiled from various websites and books convered to markdown in my case) to understand the problem well enough. That adds additional work on my part, but the net productivity is still very much positive because it's one-time setup cost.

Telling LLMs which files they should look at was indeed necessary 1-2 years ago in early models, but I have not done that for the last half year or so, and I'm working on codebases with millions of lines of code. I've also never had modern LLMs use nonexistent libraries. Sometimes they try to use outdated libraries, but it fails very quickly once they try to compile and they quickly catch the error and follow up with a web search (I use a custom web search provider) to find the most appropriate library.

I'm convinced that anybody who says that LLMs don't work for them just doesn't have a good mental model of HOW LLMs work, and thus can't use them effectively. Or their experience is just outdated.

That being said, the original issue that they don't always follow instructions from CLAUDE/AGENT.md files is quite true and can be somewhat annoying.

fnord123 · 2025-10-14T09:49:37 1760435377

> Not my experience. I've used LLMs to write highly specific scientific/niche code and they did great, but obviously I had to feed them the right context (compiled from various websites and books convered to markdown in my case) to understand the problem well enough. That adds additional work on my part, but the net productivity is still very much positive because it's one-time setup cost.

Which language are you using?

loveparade · 2025-10-14T10:58:02 1760439482

Rust, Python, and a bit of C++. Around 80% Rust probably

CuriouslyC · 2025-10-14T11:11:41 1760440301

I've been genuinely surprised how well GPT5 does with rust! I've done some hairy stuff with Tokio/Arena/SIMD that I thought I would have to hand hold it through, and it got it.

loveparade · 2025-10-14T11:21:44 1760440904

Yeah, it has been really good in my experience. I've done some niche WASM stuff with custom memory layouts and parallelism and it did great there too, probably better than I could've done without spending several hours reading up on stuff.

IX-103 · 2025-10-14T11:41:21 1760442081

It's pretty good at Rust, but it doesn't understand locking. When I tried it. It just put a lock on everything and then didn't take care to make sure the locks were released as soon as possible. This severely limited the scalability of the system it produced.

But I guess it passed the tests it wrote so win? Though it didn't seem to understand why the test it wrote where the client used TLS and the server didn't wouldn't pass and required a lot of hand holding along the way.

loveparade · 2025-10-14T12:06:05 1760443565

I've experienced similar things, but my conclusion has usually been that the model is not receiving enough context in such cases. I don't know your specific example, but in general it may not be incorrect to put an Arc/Lock on many things at once (or using Arc isntead of Rc, etc) if your future plans are parallelize several parts of your codebase. The model just doesn't know what your future plans are, and in errs on the side of "overengineering" solutions for all kinds of future possibilities. I found that this is a bias that these models tend to have, many times their code is overengineered for features I will never need and I need to tell them to simplify - but that's expected. How would the model know what I do and don't need in the future without me giving all the right context?

The same thing is true for tests. I found their tests to be massively overengineered, but that's easily fixed by telling them to adopt the testing style from the rest of the codebase.

fnord123 · 2025-10-14T15:30:48 1760455848

Rust has been an outlier in my experience as well. I have a pet theory that it is due to rust code that's been pushed to github generally compiles. And if it compiles it generally works.

isodev · 2025-10-14T10:18:57 1760437137

Coding with Claude feels like playing a slot machine. Sometimes you get more or less what you asked, sometimes totally not. I don’t think it’s wise or sane to leave them unattended.

mikkupikku · 2025-10-14T10:48:28 1760438908

If you spend most of your time in planning mode, that helps considerably. It will almost always implement whatever it is that you planned together, so if you're willing to plan extensively enough you'll more or less know what you're going to get out of it when you finally set it loose.

micoti · 2025-10-14T10:20:55 1760437255

You are absolutely right!

isodev · 2025-10-14T10:30:32 1760437832

That was a very robust and comprehensive comment

WickyNilliams · 2025-10-14T11:06:43 1760440003

Yes, and I think a lot of people are addicted to gambling. The dizzying highs when you win cloud out the losses. Even when you're down overall.

wkat4242 · 2025-10-14T10:28:35 1760437715

I found that using opus helps a lot. It's eyewateringly expensive though so I generally avoid it. I pay through the API calls because I don't tend to code much.

moconnor · 2025-10-14T10:35:12 1760438112

Genuinely interesting how divergent people's experiences of working with these models is.

I've been 5x more productive using codex-cli for weeks. I have no trouble getting it to convert a combination of unusually-structured source code and internal SVGs of execution traces to a custom internal JSON graph format - very clearly out-of-domain tasks compared to their training data. Or mining a large mixed python/C++ codebase including low-level kernels for our RISCV accelerators for ever-more accurate docs, to the level of documenting bugs as known issues that the team ran into the same day.

We are seeing wildly different outcomes from the same tools and I'm really curious about why.

vachina · 2025-10-14T11:01:42 1760439702

You are asking it to do what it already knows, by feeding it in the prompt.

hitarpetar · 2025-10-14T11:50:40 1760442640

how did you measure your 5x productivity gain? how did you measure the accuracy of your docs?

pancsta · 2025-10-14T11:54:50 1760442890

Translation is not creation.

sorcercode · 2025-10-14T14:20:40 1760451640

but genuinely. how many people are "creating", like truly novel stuff that someone hasn't thought out before?

I'd wager a majority of software engineers today are using techniques that are well established... that most models are trained on.

most current creation (IMHO) comes from wielding existing techniques in different combinations. which i wager is very much possible with LLMs

makingstuffs · 2025-10-14T08:52:27 1760431947

> and it adds _some_ value by thinking of edge cases I might’ve missed, best practices I’m unaware of, and writing better grammar than I do.

This is my most consistent experience. It is great at catching the little silly things we do as humans. As such I have found them to be most useful as PR reviewers which you take with a pinch of salt

cube00 · 2025-10-14T10:13:42 1760436822

> It is great at catching the little silly things we do as humans.

It's great, some of the time, the great draw of computing was that it would always catch the silly things we do as humans.

If it didn't we'd change the change code and the next time (and forever onward) it would catch that case too.

Now we're playing wack-a-mole and pleading with words like "CRITICAL" and bold text to our in .cursorrules to try and make the LLM pay attention, maybe it works today, might not work tomorrow.

Meanwhile the C-suite pushing these tools onto us still happily blame the developers when there's a problem.

skydhash · 2025-10-14T11:53:33 1760442813

> It's great, some of the time, the great draw of computing was that it would always catch the silly things we do as humans.

People are saying that you should write a thesis-length file of rules, and they’re the same people balking at programming language syntax and formalism. Tools like linters, test runners, compilers are reliable in a sense that you know exactly where the guardrails are and where to focus mentally to solve an issue.

cube00 · 2025-10-15T05:33:21 1760506401

This repo [1] is a brilliant illustration of the copium going into this.

Third line of the Claude prompt [2]:

IMPORTANT: You must NEVER generate or guess URLs for the user - Who knew solving LLM hallucinations was just that easy?

IMPORTANT: DO NOT ADD ***ANY*** COMMENTS unless asked - Guess we need triple bold to make it pay attention now?

It gets even more ludicrous when you see the recommendation that you should use a LLM to write this slop of a .cursorrules file for you.

[1]: https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...

[2]: https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...

nunez · 2025-10-14T12:15:09 1760444109

I'm shocked that this isn't talked about more. The pro-AI astroturfing done everywhere (well, HN and Reddit anyway) is out of this world.

dvfjsdhgfv · 2025-10-14T08:36:49 1760431009

> we know that creating CLAUDE.md or cursorrules basically does nothing

While I agree, the only cases where I actually created something barely resembling useful (while still of subpar quality) was only after putting in CLAUDE.md lines like:

YOUR AIM IS NOT TO DELIVER A PROJECT. YOU AIM IS TO DO DEEP, REPETITIVE E2E TESTING. ONLY E2E TESTS MATTER. BE EXTREMELY PESSIMISTIC. NEVER ASSUME ANYTHING WORKS. ALWAYS CHECK EVERY FEATURE IN AT LEAST THREE DIFFERENT WAYS. USE ONLY E2E TESTS, NEVER USE OTHER TYPES OF TEST. BE EXTREMELY PESSIMISTIC. NEVER TRUST ANY CODE UNLESS YOU DEEPLY TEST IT E2E

REMEMBER, QUICK DELIVERY IS MEANINGLESS, IT'S NOT YOUR AIM. WORK VERY SLOWLY, STEP BY STEP. TAKE YOUR TIME AND RE-VERIFY EACH STEP. BE EXTREMELY PESSIMISTIC

With this kind of setup, it kind attempts to work in a slightly different way than it normally does and is able to build some very basic stuff although frankly I'd do it much better so not sure about the economics here. Maybe for people who don't care or won't be maintaining this code it doesn't matter but personally I'd never use it in my workplace.

habinero · 2025-10-14T11:24:36 1760441076

My cynical working theory is this kind of thing basically never works but sometimes it just happens to coincide with useful code.

pandemic_region · 2025-10-14T11:14:12 1760440452

omg imagine giving these instructions to a junior developer to accompany his task.

benibela · 2025-10-14T13:20:32 1760448032

>BE EXTREMELY PESSIMISTIC. NEVER ASSUME ANYTHING WORKS.

search for compiler bugs

apples_oranges · 2025-10-14T08:31:53 1760430713

Too much money was invested, it needs to be sold.

motorest · 2025-10-14T15:53:16 1760457196

> Beyond this, if you’re working on novel code, LLMs are absolutely horrible at doing anything. A lot of assumptions are made, non-existent libraries are used, and agents are just great at using tokens to generate no tangible result whatsoever.

That's not my experience at all. A basic prompt file is all it takes to cover each and any assumption you leave out from your prompts. Nowadays the likes of Copilot even provide support out of the box for instruction files, and you can create them with a LLM prompt too.

Sometimes I wonder what is the first-hand experience of the most vocal LLM haters out here. They seem to talk an awful lot about issues that feel artificial and not grounded in reality. It's like we are discussing that riding a bicycle is great, and these guys start ranting on how the biking industry is in a bubble because they don't even manage to stay up with side wheels on. I mean, have you bothered to work on the basics?

saltysalt · 2025-10-14T09:20:00 1760433600

Nailed it. The other side of the marketing hype cycle will be saner, when the market forces sort the wheat from the chaff.

peab · 2025-10-14T13:53:47 1760450027

There's more money to be made right now in selling courses than actually using the LLM well. So these guys pretend that they found all these ways to make agents, and they market it and people buy the course

fho · 2025-10-14T08:36:02 1760430962

> On the ground, we know that creating CLAUDE.md or cursorrules basically does nothing.

I don't agree with this. LLMs will go out of their way to follow any instruction they find in their context.

(E.g. i have "I love napkin math" in my kagi Agent Context, and every LLM will try to shoehorn some kind of napkin math into every answer.)

Cursor and Co do not follow these instructions because they:

(a) never make it into the context in the first place, or (b) fall out of the context window.

nvarsj · 2025-10-14T11:28:53 1760441333

My experience is kind of the opposite of what you describe (working in big tech). Like, I'm easily hitting 10x levels of output nowadays, and it's purely enabled by agentic coding. I don't really have an answer for why everyone's experience is so different - but we should be careful to not paint in broad strokes our personal experience with AI: "everyone knows AI is bad" - nope!

What I suspect is it _heavily_ depends on the quality of the existing codebase and how easy the language is to parse. Languages like C++ really hurts the agent's ability to do anything, unless you're using a very constrained version of it. Similarly, spaghetti codebases which do stupid stuff like asserting true / false in tests with poor error messages, and that kind of thing, also cause the agents to struggle.

Basically - the simpler your PL and codebase, the better the error and debugging messages, the easier it is to be productive with the AI agents.