Hoping someone here may know the answer to this, but do any of the benchmarks th...

terandle · 2025-11-19T00:02:57 1763510577

https://artificialanalysis.ai/evaluations/omniscience

rocqua · 2025-11-19T00:06:59 1763510819

Those numbers are too good to expect. If 90% right 10% wrong is the baseline would you take as an improvement:

- 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%

The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.

Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.

fwip · 2025-11-19T04:39:10 1763527150

I think you may have misread. They stated that they'd be willing to go from 90% correct to 10% correct for this tradeoff.

rocqua · 2025-11-19T17:48:55 1763574535

Thanks for the correction

energy123 · 2025-11-19T00:32:43 1763512363

OpenAI uses SimpleQA to assess hallucinations