This is classic Goodhart's law. "When a measure becomes a target, it ceases to b...

ambicapter · 2025-08-12T13:21:46 1755004906

It's really not that hard to not build a custom bench setup to game the benchmark instead of just using your product straight out of the box, though.

VikingCoder · 2025-08-12T15:08:35 1755011315

Right, other than financial pressure. Which is, of course, immense.

jasonjmcghee · 2025-08-12T14:40:58 1755009658

Right. Building a custom setup is blatant- that will wildly overfit.

But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.

This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.

clutchdude · 2025-08-12T13:26:09 1755005169

Also see the VW dieselgate and numerous other "gaming the system" examples.

kelipso · 2025-08-12T20:22:15 1755030135

A specific setup for the benchmark is just plain cheating, not Goodhart’s law.