Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is classic Goodhart's law. "When a measure becomes a target, it ceases to be a good measure"

https://en.wikipedia.org/wiki/Goodhart%27s_law



It's really not that hard to not build a custom bench setup to game the benchmark instead of just using your product straight out of the box, though.


Right, other than financial pressure. Which is, of course, immense.


Right. Building a custom setup is blatant- that will wildly overfit.

But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.

This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.


Also see the VW dieselgate and numerous other "gaming the system" examples.


A specific setup for the benchmark is just plain cheating, not Goodhart’s law.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: