Did you use the same prompts for all the models, or individualized prompts per m...

RicardoRei · 2025-12-10T15:39:18 1765381158

The prompts are kept the same for all models. Otherwise comparison would not be super fair. In any case you can check all the prompts in our github repo.

embedding-shape · 2025-12-10T15:46:41 1765381601

> Otherwise comparison would not be super fair.

Wouldn't that be easy to make fair by making sure all models tried it with the same prompts? So you have model X and Y, and prompts A and B, and X runs once with A, once with B, and same for Y.

Reason I ask, is because in my own local benchmarks I do for each model release with my own tasks, I've noticed a huge variance in quality of responses based on the prompts themselves. Slight variation of wording seems to have a big effect on the final responses, and those variations seems to again have a big variance of effect depending on the model.

Sometimes a huge system prompt makes a model return much higher quality responses while another model gives much higher quality responses when the system prompt is as small as it possible can. At least this is what I'm seeing with the local models I'm putting under test with my private benchmarks.

irthomasthomas · 2025-12-10T18:04:36 1765389876

Did you re-test the past models with the new prompt you found? How many times did you run each prompt? Did you use the same rubric to score each experiment?

embedding-shape · 2025-12-10T22:47:01 1765406821

> Did you re-test the past models with the new prompt you found?

Yeah, initially I wrote this test/benchmark harness because I wanted to compare multiple different prompts for the same tasks and the same model, but obviously eventually grew out from there. But it still has the prompts at core, and I re-run everything whenever something changes, or I add new models to it.

> How many times did you run each prompt?

It's structured in a way of Category > Task > Case and that's mixed with a list of Prompts for each Task, then each Case runs with each of the Prompts. So I guess you could say that each prompt gets "exercised" the number of existing cases that exists for the Task they're in.

> Did you use the same rubric to score each experiment?

I'm not sure if you mean something specific by "rubric" (I'm not from academia), but they're all pretty much binary "passed" or "not passed". The coding ones are backed by unit tests that were failing, and after test case must pass without being changed, translation ones backed by (mostly) simple string checking, and so on. I don't have any tasks or cases that are "Rate this solution from 0-10" or similar.

EagnaIonat · 2025-12-10T18:41:59 1765392119

Models have different nuances though. Llama4 for example you have to explicitly ask it not to output its CoT, whereas GPT you don't.