Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hoping someone here may know the answer to this, but do any of the benchmarks that exist currently account for false answers in any meaningful way, other than it would in a typical test (ie, if I give any answer at all it is better than saying "I don't know" as the answer I give at least has a chance of being correct(which in the real world is bad))? I want an LLM that tells me when it doesn't know something. If it gives me an accurate response 90% of the time and an inaccurate one 10% of the time, it is less useful than one that gives me an accurate answer 10% of the time and tells me "I don't know" the other 90%.



Those numbers are too good to expect. If 90% right 10% wrong is the baseline would you take as an improvement:

- 80% right 18% I don't know 2% wrong - 50%/48%/2% - 10%/90%/0% - 80%/15%/5%

The general point being that to reduce wrong answers you will need to accept some reduction in right answers if you want the change to only be made through trade-offs. Otherwise you just say "I'd like a better system" and that is rather obvious.

Personally I'd take like 70/27/3. Presuming the 70% of right answers aren't all the trivial questions.


I think you may have misread. They stated that they'd be willing to go from 90% correct to 10% correct for this tradeoff.


Thanks for the correction


OpenAI uses SimpleQA to assess hallucinations




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: