Did you use the same prompts for all the models, or individualized prompts per model? Did you try a range of prompts that were very different from each other, if you used more than a base prompt?
I'm sure it's somewhere in the details somewhere, but after a quick skim I didn't find anything outlined about how you managed and used the prompts, and if it was per model or not.
Thanks a bunch for being open to answering questions here, and thanks for trying to attack this particular problem with scientific rigor, even if it's really difficult to do so.
The prompts are kept the same for all models. Otherwise comparison would not be super fair. In any case you can check all the prompts in our github repo.
Wouldn't that be easy to make fair by making sure all models tried it with the same prompts? So you have model X and Y, and prompts A and B, and X runs once with A, once with B, and same for Y.
Reason I ask, is because in my own local benchmarks I do for each model release with my own tasks, I've noticed a huge variance in quality of responses based on the prompts themselves. Slight variation of wording seems to have a big effect on the final responses, and those variations seems to again have a big variance of effect depending on the model.
Sometimes a huge system prompt makes a model return much higher quality responses while another model gives much higher quality responses when the system prompt is as small as it possible can. At least this is what I'm seeing with the local models I'm putting under test with my private benchmarks.
Did you re-test the past models with the new prompt you found? How many times did you run each prompt? Did you use the same rubric to score each experiment?
> Did you re-test the past models with the new prompt you found?
Yeah, initially I wrote this test/benchmark harness because I wanted to compare multiple different prompts for the same tasks and the same model, but obviously eventually grew out from there. But it still has the prompts at core, and I re-run everything whenever something changes, or I add new models to it.
> How many times did you run each prompt?
It's structured in a way of Category > Task > Case and that's mixed with a list of Prompts for each Task, then each Case runs with each of the Prompts. So I guess you could say that each prompt gets "exercised" the number of existing cases that exists for the Task they're in.
> Did you use the same rubric to score each experiment?
I'm not sure if you mean something specific by "rubric" (I'm not from academia), but they're all pretty much binary "passed" or "not passed". The coding ones are backed by unit tests that were failing, and after test case must pass without being changed, translation ones backed by (mostly) simple string checking, and so on. I don't have any tasks or cases that are "Rate this solution from 0-10" or similar.
I'm sure it's somewhere in the details somewhere, but after a quick skim I didn't find anything outlined about how you managed and used the prompts, and if it was per model or not.
Thanks a bunch for being open to answering questions here, and thanks for trying to attack this particular problem with scientific rigor, even if it's really difficult to do so.