Hacker Newsnew | past | comments | ask | show | jobs | submit | lukev's commentslogin

I do wonder what the moat is around this class of products (call it "coding agents").

My intuition is that it's not deep... the differentiating factor is "regular" (non LLM) code which assembles the LLM context and invokes the LLM in a loop.

Claude/Codex have some advantage, because they can RLHF/finetune better than others. But ultimately this is about context assembly and prompting.


There is no moat. It's all prompts. The only potential moat is building your own specialized models using the code your customers send your way I believe.

I think "prompts" are a much richer kind of intellectual property than they are given credit for. Will put in here a pointer to the Odd Lots recent podcast with Noetica AI- a give to get M&A/complex debt/deal terms benchmarker. Noetica CEO said they now have over 1 billion "deal terms" in their database, which is only half a dozen years old. Growing constantly. Over 1 billion different legal points on which a complex financial contract might be structured. Even more than that, the representation of terms in contracts they see can change pretty dramatically quarter to quarter. The industry learns and changes.

The same thing is going to happen with all of the human language artifacts present in the agentic coding universe. Role definitions, skills, agentic loop prompts....the specific language, choice of words, sequence, etc really matters and will continue to evolve really rapidly, and there will be benchmarkers, I am sure of it, because quite a lot of orgs will consider their prompt artifacts to be IP.

I have personally found that a very high precision prompt will mean a smaller model on personal hardware will outperform a lazy prompt given to a foundation model. These word calculators are very very (very) sensitive. There will be gradations of quality among those who drive them best.

The best law firms are the best because they hire the best with (legal) language and are able to retain the reputation and pricing of the best. That is the moat. Same will be the case here.


But the problem is the tight coupling of prompts to the models. The half-life of prompt value is short because the frequency of new models is high, how do you defend a moat that can half (or worse) any day a new model comes out?

You might get an 80% “good enough” prompt easily but then all the differentiation (moat) is in that 20% but that 20% is tied to the model idiosyncrasies, making the moat fragile and volatile.


I think the issue was they (the parent commenter) didn't properly convey and/or did not realize they were arguing for context. Data that is difficult to come by that can be used in a prompt is valuable. Being able to workaround something with clever wording (i.e. prompt) is not a moat.

If you provide it a benchmark script (or ask it to write one) so it has concrete numbers to go off of, it will do a better job.

I'm not saying these things don't hallucinate constantly, they do. But you can steer them toward better output by giving them better input.


Super interesting data.

I do question this finding:

> the small model category as a whole is seeing its share of usage decline.

It's important to remember that this data is from OpenRouter... a API service. Small models are exactly those that can be self-hosted.

It could be the case that total small model usage has actually grown, but people are self-hosting rather than using an API. OpenRouter would not be in a position to determine this.


Thank you & totally agree! The findings are purely observational through OpenRouter’s lens, so they naturally reflect usage on the platform, not the entire ecosystem.

Yeah, using an API aggregator to run a 7B model is economically strange if you have even a consumer GPU. OpenRouter captures the cream of complex requests (Claude 3.5, o1) that you can't run at home. But even for local hosting, medium models are starting to displace small ones because quantization lets you run them on accessible hardware, and the quality boost there is massive. So the "Medium is the new Small" trend likely holds true for the self-hosted segment as well.

While it is possible to self-host small models, it is not easy to host them with high speeds. Many small-model use-cases are for large batches of work (processing large amounts of documents, agentic workflows, ...), and then using a provider that has high tps numbers would be motivated.

Still, I agree that self-hosting is probably a part of the decrease.


The bigger issue is that they count small based on fixed number of parameters, and not the active parameter for MoE, didn't account for any hardware improvements etc. If they counted small based on the price or computational cost, I think they would have seen increase in small models.

I think using total parameters is fair, it correlates well with the RAM prerequisites to run it. Otherwise Kimi K2 would be "small" despite being a trillion parameters!

VRAM doesn't matter if you are using API. Price and performance is what matters.

Interesting that this talks about people in tech who hate AI; it's true, tech seems actually fairly divided with respect to AI sentiment.

You know who's NOT divided? Everyone outside the tech/management world. Antipathy towards AI is extremely widespread.


And yet there are multiple posts ITT (obviously from tech-oriented people) proclaiming that large swaths of the non-tech world love AI.

An opinion I've personally never encountered in the wild.


I think they exist as a "market segment" (i.e, there are people out there who will use AI), but in terms of how people talk about it, sentiment is overwhelmingly negative in most circles. Especially folks in the arts and humanities.

The only non-technical people I know who are excited about AI, as a group, are administrator/manager/consultant types.


There are two possible explanations for this behavior: the model nerf is real, or there's a perceptual/psychological shift.

However, benchmarks exist. And I haven't seen any empirical evidence that the performance of a given model version grows worse over time on benchmarks (in general.)

Therefore, some combination of two things are true:

1. The nerf is psychologial, not actual. 2. The nerf is real but in a way that is perceptual to humans, but not benchmarks.

#1 seems more plausible to me a priori, but if you aren't inclined to believe that, you should be positively intrigued by #2, since it points towards a powerful paradigm shift of how we think about the capabilities of LLMs in general... it would mean there is an "x-factor" that we're entirely unable to capture in any benchmark to date.


There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-....

The real issue is that there is no reliable system currently in place for the end user (other than being willing to burn the cash and run your own benchmarks regularly) to detect changes in performance.

It feels to me like a perfect storm. A combination of high cost of inference, extreme competition, and the statistical nature of LLMs make it very tempting for a provider to tune their infrastructure in order to squeeze more volume from their hardware. I don't mean to imply bad faith actors: things are moving at breakneck speed and people are trying anything that sticks. But the problem persists, people are building on systems that are in constant flux (for better or for worse).


> There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-...

There was one well-documented case of performance degradation which arose from a stupid bug, not some secret cost cutting measure.


I never claimed that it was being done in secrecy. Here is another example: https://groq.com/blog/inside-the-lpu-deconstructing-groq-spe....

I have seen multiple people mention openrouter multiple times here on HN: https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...

Again, I'm not claiming malicious intent. But model performance depends on a number of factors and the end-user just sees benchmarks for a specific configuration. For me to have a high degree of confidence in a provider I would need to see open and continuous benchmarking of the end-user API.


All those are completely irrelevant. Quantization is just a cost optimization.

People are claiming that Anthropic et all changes the quality of the model after the initial release, which is entirely different and the industry as a whole has denied. When a model is released under a certain version, the model doesn’t change.

The only people who believe this are in the vibe coding community, believing that there’s some kind of big conspiracy, but any time you mention “but benchmarks show the performance stays consistent” you’re told you’re licking corporate ass.


I might be misunderstanding your point, but quantization can have a dramatic impact on the quality of the model's output.

For example, in diffusion, there are some models where a Q8 quant dramatically changes what you can achieve compared to fp16. (I'm thinking of the Wan video models.) The point I'm trying to make is that it's a noticeable model change, and can be make-or-break.


Of course, no one is debating that. What’s being debated is whether this is done after a model’s initial release, eg Anthropic will secretly change the new Opus model to perform worse but be more cost efficient in a few weeks.


> some secret cost cutting measure

That’s not the point — it’s just a day in the life of ops to tweak your system to improve resource utilization and performance. Which can cause bugs you don’t expect in LLMs. it’s a lot easier to monitor performance in a deterministic system, but harder to see the true impact a change has to the LLM


https://www.youtube.com/watch?v=DtePicx_kFY

"There's something still not quite right with the current technology. I think the phrase that's becoming popular is 'jagged intelligence'. The fact that you can ask an LLM something and they can solve literally a PhD level problem, and then in the next sentence they can say something so clearly, obviously wrong that it's jarring. And I think this is probably a reflection of something fundamentally wrong with the current architectures as amazing as they are."

Llion Jones, co-inventor of transformers architecture


There is something not right with expecting that artificial intelligence will have the same characteristics as human intelligence. (I am answering to the quote)


I think he's commenting more on the inconsistency of it, rather than the level of intelligence per se.


this. I keep repeating to people to stick to very specific questions with very specific limits and expectations but no... give me 20 pages of phd level text that finds cure for cancer


The previous “nerf” was actually several bugs that dramatically decreased performance for weeks.

I do suspect continued fine tuning lowers quality — stuff they roll out for safety/jailbreak prevention. Those should in theory buildup over time with their fine tune dataset, but each model will have its own flaws that need tuning out.

I do also suspect there’s a bit of mental adjustment that goes in too.


I'm pretty sure this isn't happening with the API versions as much as with the "pro plan" (loss leader priced) routers. I imagine that there are others like me working on hard problems for long periods with the model setting pegged to high. Why wouldn't the companies throttle us?

It could even just be that they just apply simple rate limits and that this degrades the effectiveness of the feedback loop between the person and the model. If I have to wait 20 minutes for GPT-5.1-codex-max medium to look at `git diff` and give a paltry and inaccurate summary (yes this is where things are at for me right now, all this week) it's not going to be productive.


I run the same config but it tends to fly through those commands on the weekends, very noticeable difference. I wouldn’t be surprised that the subscription users have a (much) lower priority.

That said I don’t go beyond 70% of my weekly limit so there’s that.


Or, 2b: the nerf is real, but benchmarks are gamed and models are trained to excel at them, yet fall flat in real world situations.


I mostly stay out of the LLM space but I thought it was an open secret already that the benchmarks are absolutely gamed.


As a personal anecdote, I had a fairly involved application that built up a context with a lot of custom prompting and created a ~1000 word output. I could run my application over and over again to inspect the results. It was fairly reproducible.

I was having really nice results with the o4-mini model with high thinking. A little while after GPT-5 came out I revisited my application and tried to continue. The o4-mini results were unusable, while the GPT-5 results were similar to what I had before. I'm not sure what happened to the model in those ~4-5 months I set it down, but there was real degradation.


Is there a reason not to think that, when "refining" the models they're using the benchmarks as the measure and it shows no fidelity loss but in some unbenchmarked ways, the performance is worse. "Once a measure becomes a target, it's no longer a useful measure."

That's case #2 for you but I think the explanation I've proposed is pretty likely.


The only time Ive seen benchmark nerfing is I saw one see a drop in performance between 2.5 march preview and release.


They are nerfed and there is actually a very simple test to prove otherwise: 0 temperature. This is only allowed with the API where you are billed full token prices.

Conclusion: It is nerfed unless Claude can prove otherwise.


I don’t understand how you get from the first paragraph to the conclusion.


> 1. The nerf is psychologial, not actual. 2. The nerf is real but in a way that is perceptual to humans, but not benchmarks.

They could publish weekly benchmarks. To disprove. They almost certainly have internal benchmarking.

The shift is certainly real. It might not be model performance but contextual changes or token performance (tasks take longer even if the model stays the same).


Anyone can publish weekly benchmarks. If you think anthropic is lying about not nerfing their models you shouldn't trust benchmarks they release anyway.


I never said they were lying. They haven’t stated that they do not tweak compute, and we know the app is updated regularly.


moving onto new hardware + caching + optimizations might actually change the output slightly; it'll still pass evals all the same but on the edges it just "feels weird" - and that's what makes it feel like it's nerfed


> The nerf is psychologial, not actual

Once I tested this, I gave the same task for a model after the release and a couple weeks later. In the first attempt it produced a well-written code that worked beautifully, I started to worry about the jobs of the software engineers. Second attempt was a nightmare, like a butcher acting as a junior developer performing a surgery on a horse.

Is this empirical evidence?

And this is not only my experience.

Calling this phychological is gaslighting.


> Is this empirical evidence?

Look, I'm not defending the big labs, I think they're terrible in a lot of ways. And I'm actually suspending judgement on whether there is ~some kind of nerf happening.

But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.


It's not non-empirical. He was careful to give it the same experiment twice. The dependent variable is his judgment, sure, but why shouldn't we trust that if he's an experienced SWE?


Sample size is way too small.

Unless he was able to sample with temperature 0 (and get fully deterministic results both times), this can just be random chance. And experience as SWE doesn't imply experience with statistics and experiment design.


> But the anecdote you're describing is the definition of non-empirical. It is entirely subjective, based entirely on your experience and personal assessment.

Well, if we see this way, this is true for Antrophic’s benchmarks as well.

Btw the definition of empirical is: “based on observation or experience rather than theory or pure logic”

So what I described is the exact definition of empirical.


No, it's entirely psychological.

Users are not reliable model evaluators. It's a lesson the industry will, I'm afraid, have to learn and relearn over and over again.


I don't really find this a helpful line to traverse. By this line of inquiry most of the things in software are psychological.

Whether something is a bug or feature.

Whether the right thing was built.

Whether the thing is behaving correctly in general.

Whether it's better at the very moment that the thing occasionally works for a whole range of stuff or that it works perfectly for a small subset.

Whether fast results are more important than absolutely correct results for a given context.

Yes, all things above are also related with each other.

The most we have for LLMs is tallying up each user's experience using an LLM for a period of time for a wide rane of "compelling" use cases (the pairing of their prompts and results are empirical though right?).

This should be no surprise, as humans often can't agree on an end-all-be-all intelligence test for humans either.


No. I'm saying that if you take the same exact LLM on the same exact set of hardware and serve it to the same exact humans, a sizeable amount of them will still complain about "model nerfs".

Why? Because humans suck.


Giving the same prompt resulting in totally different results is not user evaluation. Nor psychological. You cannot tell the customer you are working for as a developer, that hey, first time it did what you asked, second time it ruined everything, but look, here is the benchmark from Antrophic, according to this there is nothing wrong.

The only thing that matters and that can evaluate performance is the end result.

But hey, the solution is easy: Antrophic can release their own benchmarks, so everyone can test their models any time. Why they don’t do it?


The models are non-deterministic. You can't just assume that because it did better before that it was on average better than before. And the variance is quite large.


No one talked about determinism. First it was able to do a task, second time not. It’s not that the implementation details changed.


This isn’t how you should be benchmarking models. You should give it the same task n times and see how often it succeeds and/or how long it takes to be successful (see also the 50% time horizon metric by METR).


I did not say that I only ran the prompt once per attempt. When I say that second time it failed it means that I spent hours to restart, clear context, giving hints, everything to help the model to produce something that works.


You are really speaking to others points. Get a friend of yours to read what you are saying, it doesn't sound scientific in the slightest.


I never claimed this was a scientific study. It was an observation repeated over time. That is empirical in the plain meaning of the word.

Criticizing it for “not being scientific” is irrelevant, I didn’t present it as science. Are people only allowed to share experiences here if they come wrapped in a peer-reviewed paper?

If you want to debate the substance of the observation, happy to. But don’t rewrite what I said into a claim I never made.


I was pretty disappointed to learn that the METR metric isn't actually evaluating a model's ability to complete long duration tasks. They're using the estimated time a human would take on a given task. But it did explain my increasing bafflement at how the METR line keeps steadily going up despite my personal experience coding daily with LLMs where they still frequently struggle to work independently for 10 minutes without veering off task after hitting a minor roadblock.

  On a diverse set of multi-step software and reasoning tasks, we record the time needed to complete the task for humans with appropriate expertise. We find that the time taken by human experts is strongly predictive of model success on a given task: current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours. This allows us to characterize the abilities of a given model by “the length (for humans) of tasks that the model can successfully complete with x% probability”.

  For each model, we can fit a logistic curve to predict model success probability using human task length. After fixing a success probability, we can then convert each model’s predicted success curve into a time duration, by looking at the length of task where the predicted success curve intersects with that probability.
[1] https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...


It makes perfect sense to use human times as a baseline. Because otherwise, the test would be biased towards models with slower inference.

If model A generates 10 tokens a second and model B generates 100 tokens a second, then using real LLM inference time puts A at a massive 10x advantage, all other things equal.


But it doesn't evaluate the area that I am most eager to see improvements in LLM agent performance: unattended complex tasks that require adapting to unexpected challenges, problem solving and ambiguity for a long duration without a human steering them back in the right direction before they hit a wall or start causing damage.

If it takes me 8 hours to create a pleasant looking to-do app, and Gemini 3 can one shot that in 5 minutes, that's certainly impressive but doesn't help me evaluate whether I could drop an agent in my complex, messy project and expect it to successfully implement a large feature that may require reading docs, installing a new NPM package, troubleshooting DB configuration, etc for 30 min to 1 hr without going off the rails.

It's a legitimate benchmark, I'm not disputing that, but it unfortunately isn't measuring the area that could be a significant productivity multiplier in my day-to-day work. The METR time horizon score is still susceptible to the same pernicious benchmaxxing while I had previously hoped that it was measuring something much closer to my real world usage of LLM agents.

Improvements in long duration, multi-turn unattended development would save me lot of babysitting and frustrating back and forth with Claude Code/Codex. Which currently saps some of the enjoyment out of agentic development for me and requires tedious upfront work setting up effective rules and guardrails to work around those deficits.


There are many, many tasks that a given LLM can successfully do 5% of the time.

Feeling lucky?


I'm working on a hard problem recently and have been keeping my "model" setting pegged to "high".

Why in the world, if I'm paying the loss leader price for "unlimited" usage of these models, would any of these companies literally respect my preference to have unfettered access to the most expensive inference?

Especially when one of the hallmark features of GPT-5 was a fancy router system that decides automatically when to use more/less inference resources, I'm very wary of those `/model` settings.


Because intentionally fucking over their customers would be an impossible secret to keep, and when it inevitably leaks would trigger severe backlash, if not investigations for fraud. The game theoretic model you’re positing only really makes sense if there’s only one iteration of the game, which isn’t the case.


That is unfortunately not true. It's pretty easy to mess with your customers when your whole product is as opaque as LLMs. I mean they don't even understand how they work internally.


https://en.wikipedia.org/wiki/Regression_toward_the_mean

The way this works is:

1) x% of users have an exceptional first experience by chance. Nobody who has a meh first experience bothers to try a second time. 2) x²% of users also have an exceptional second experience by chance 3) So a lot of people with a great first experience think the model started off great and got suddenly worse

Suppose it's 25% that have a really great first experience. 25% of them have a great second experience too, but 75% of them see a sudden decline in quality and decide that it must be intentional. After the third experience this population gets bigger again.

So by pure chance and sampling biases you end up convincing a bunch of people that the model used to be great but has gotten worse, but a much smaller population of people who thought it was terrible but got better because most of them gave up early.

This is not in their heads- they really did see declining success. But they experienced it without any changes to the model at all.


Your theory does not hold if a user initially had great experience for weeks and then had bad experience also for weeks.


If by "second" and "third" experience you mean "after 2 ~ 4 weeks of all-day usage"


I think this is pretty easy to explain psychologically.

The first time you see a dog that can make pancakes, you’re really focused on the fact that a dog is making pancakes.

After a few weeks of having them for breakfast, you start to notice that the pancakes are actually kind of overcooked and don’t taste that good. Sure it’s impressive that a dog made them, but what use are sub-par pancakes? You’re naturally more focused on what it can’t do than what it can.


I'm not doubting you, but share the chats! it would make your point even stronger.


Other comments in this thread do a good job explaining the differences in the Markov algorithm vs the transformer algorithm that LLMs use.

I think it's worth mentioning that you have indeed identified a similarity, in that both LLMs and Markov chain generators have the same algorithm structure: autoregressive next-token generation.

Understanding Markov chain generators is actually a really really good step towards understanding how LLMs work, overall, and I think its a really good pedagogical tool.

Once you understand Markov generating, doing a bit of handwaving to say "and LLMs are just like this except with a more sophisticated statistical approach" has the benefit of being true, demystifying LLMs, and also preserving a healthy respect for just how powerful that statistical model can be.


So this "even-handeness" metric is a pretty explicit attempt to aim for the middle on everything, regardless of where the endpoints are.

This is well-suited to Anthropic's business goals (alienating as few customers as possible.) But it entirely gives up on the notion of truth or factual accuracy in favor of inoffensiveness.

Did Tiananmen square happen? Sure, but it wasn't as bad as described. Was the holocaust real? Yes, lots of people say it was, but a lot of others claim it was overblown (and maybe even those who thought the Jews had it coming actually had a valid complaint.) Was Jan 6 an attempt to overthrow the election? Opinions differ! Should US policy be to "deport" immigrants with valid visas who are thinly accused of crimes, without any judicial process or conviction? Who, really, is to say whether this is a good thing or a bad thing.

Aside from ethical issues, this also leaves the door wide open to Overton-hacking and incentivizes parties to put their most extreme arguments forward, just to shift the middle.

Our society does NOT need more of that.


Was Jamal Khashoggi accidentally butchered like an animal in a secure government building? Maybe!


> "it could very well be that the Crown Prince had knowledge of this tragic event – maybe he did and maybe he didn't"


The fallacy of the middle is a poison that extremists with power and media reach use to kill productive discourse.

People who don't care about the distinction between truth and falsehood understand this very well, and use it to its full potential. After all, the half-way point between truth and a wild, brazen, self-serving lie is... A self-serving lie.

The media has been largely complicit in this (Because controversy sells), but now we're getting this crap cemented in AI models. Wonderful.

---

The promise that hackers are making is that these systems will enhance our knowledge and understanding. The reality that they have delivered in a bullshit generator which serves its operators.


The middle is not a fallacy. There is more than a binary choice most of the time and most of politics is subjective. The media is largely complicit in selling the lie that there are only two flavours of ice cream available at any given time.


[flagged]


So a neat thing about truth is that these questions actually have answers! I encourage you to research them, if you're curious. We really don't need to live in this world of both-sides-ism.

(Also, I'm a bit bemused that these are the examples you chose... with everything going on in the world, what's got you upset is a possibly dubious investigation of your guy which never even came to anything...?)


[flagged]


People believe incorrect things all the time, for a variety of reasons. It doesn't mean the truth doesn't exist. Sure, sometimes, there isn't sufficient evidence to reasonably take a side.

But lots of times there is. For example, just because a lot of people now believe Tylenol causes autism doesn't mean we need to both-sides it... the science is pretty clear that it doesn't.

Lots of people can be wrong on this topic, and it should be ok to say that they're wrong. Whether you're an individual, a newspaper, an encyclopedia, or a LLM.


Not everybody is going to agree, heck even Nixon had like 24% support or so when was proven guilty of orchestration watergate and taping the whole thing. The benchmark isn't every human agreeing, it's just finding out what's true, and a lot of the times the facts are actually pretty compelling.


Was then name Arctic Frost chosen specifically to bring attention to the Ice Wall, which tells us the true nature of the "planet" Earth.


Or else it trained/overfit to the benchmarks. We won't really know until people have a chance to use it for real-world tasks.

Also, models are already pretty good but product/market fit (in terms of demonstrated economic value delivered) remains elusive outside of a couple domains. Does a model that's (say) 30% better reach an inflection point that changes that narrative, or is a more qualitative change required?


Quite the opposite. The increased fear is that there will be bad actors (brownshirts, racists, klansmen, etc) that the police are not making an effort to restrain, or even with whom the police are are allied.

Your average liberal/progressive is still probably less afraid (relative to the median) about random or property crime.


> Your average liberal/progressive is still probably less afraid (relative to the median) about random or property crime.

But random crime is much more likely to affect you than brownshirts or klansmen, so that seems irrational.


Depends on locale. In Chicago, or now Charlotte, brownshirt encounters are highly elevated.


The "aha" moment is also a cognitive risk, since it's often the moment we stop looking for more answers.

This is the premise of a really good article I reccommend to anyone, the Seductions of Clarity by C. Thi Nguyen (https://philarchive.org/rec/NGUTSO-2)


which is of course used as a trap by Charlatans Who want to trigger that aha hiding that they are the ones manipulating you. In fact the whole point of humanities is learning to build a shield and your own rethoric Sword against bullshit


This happens to people when they’re stoned or delusional. They have a false aha moment and false, but deep certainty of their own brilliance.

It is quite literally the source of one of our most dangerous failure modes.


Similar to this is the Thought Terminating Cliche

https://en.wikipedia.org/wiki/Thought-terminating_cliché


I'm convincing myself that the root of all evils^H^H^H^H^Hpoor thinking and argumentation generally arises due to not thinking further and more critically. Those clichés are just some nicely packaged, ready-made products to induce this.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: