Right. Building a custom setup is blatant- that will wildly overfit.
But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.
This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.
https://en.wikipedia.org/wiki/Goodhart%27s_law