Right, this result seems meaningless without a human clinician control.
I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.
Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible and even unprofessional therapy experiences.
The results are not meaningless but they are not comparing humans against LLMs. The goal is to have something that can be used to test LLMs on a realistic mental health support.
The main points of our methodology are:
1) prove that is possible to simulate patients with an LLM. Which we did.
2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.
Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study
So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.
But not meaningful in terms of comparing LLMs with human clinicians.
So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?
How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?
I don't understand where the negative framing of your title is coming from.
LLMs have room for improvement (we show that their scores are medium-low on several dimensions).
Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.
the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...
We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them
But you chose the word "struggle". And now you say:
> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.
That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?
Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.
I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.
> Right, this result seems meaningless without a human clinician control.
> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Does it really matter? Per the OP:
>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).
I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length, it's reasonable to expect it would be worse than a human (who typically don't the the characteristic of becoming increasingly unhinged the longer you talk to them).
> Betterhelp is a nightmare for clients and therapists alike. Their only mission seems to be in making as much money as possible for their shareholders. Otherwise they don't seem at all interested in actually helping anyone. Stay away from Betterhelp.
So taking it as a baseline would bias any experiment against human therapists.
Yes, it absolutely does matter. Look at what you write:
> I'd assume
> it's reasonable to expect
The whole reason to do a study is to actually study as opposed to assume and expect.
And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and that they can afford. LLM's need to be compared, first, to the alternatives that are readily available.
And remember that all therapists on BetterHelp are licensed, with a master's or doctorate, and meet state board requirements. So I don't understand why that wouldn't be a perfectly reasonable baseline.
> I love how the top comment on that Reddit post is an affiliate link to an online therapy provider.
Posted 6 months after the post and all the rest of the comments. It's some kind of SEO manipulation. That reddit thread ranked highly in my Google search about Betterhelp being bad, so they're probably trying to piggyback on it.
I’m not against affiliate links. I’m just pro-disclosure especially for something as important as therapy and it seems like maybe you should mention you make $150 for each person that signs up.
I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.
Ideally this should be done blind, I don't know if BetterHelp allows for therapy through a text chat interface? Where the therapist has no idea it's for a study and so isn't trying to "do better" then they would for any average client.
Because while I know a lot of people for whom therapy has been life-changing, I also know of a lot of terrible and even unprofessional therapy experiences.