Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for open sourcing this.

I'm skeptical of the value of this benchmark, and I'm curious for your thoughts - self play / reinforcement tasks can be useful in a variety of arenas, but I'm not a priori convinced they are useful when the intent is to help humans in situations where theories of mind matter.

That is, we're using the same underlying model(s) to simulate both a patient and a judgment as to how patient-like that patient is -- this seems like an area where I'd really want to feel confident that my judge LLM is accurate; otherwise the training data I'm generating is at risk of converging on a theory of mind / patients that's completely untethered from, you know, patients.

Any thoughts on this? Feel like we want a human in the loop somewhere here, probably on scoring the judge LLMs determinations until we feel that the judge LLM is human or superhuman. Until then, this risks building up a self-consistent, but ultimately just totally wrong, set of data that will be used in future RL tasks.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: