Right, this result seems meaningless without a human clinician control. I'd very...

RicardoRei · 2025-12-10T15:44:38 1765381478

The results are not meaningless but they are not comparing humans against LLMs. The goal is to have something that can be used to test LLMs on a realistic mental health support.

The main points of our methodology are: 1) prove that is possible to simulate patients with an LLM. Which we did. 2) prove that an LLM as a Judge can effectively score conversations according to several dimensions that are similar to how clinicians are also evaluated. Which we also did and we show that the average correlation with human evaluators is medium-high.

Given 1) and 2) we can then benchmark LLMs and as you see, there is plenty of room for improvement. We did not claim anything regarding human performance... its likely that human performance also needs to improve :) thats another study

crazygringo · 2025-12-10T15:48:30 1765381710

Got it, thank you.

So the results are meaningful in terms of establishing that LLM therapeutic performance can be evaluated.

But not meaningful in terms of comparing LLMs with human clinicians.

So in that case, how can you justify the title you used for submission, "New benchmark shows top LLMs struggle in real mental health care"?

How are they struggling? Struggling relative to what? For all your work shows, couldn't they be outperforming the average human? Or even if they're below that, couldn't they still have a large net positive effect with few negative outcomes?

I don't understand where the negative framing of your title is coming from.

RicardoRei · 2025-12-10T16:13:01 1765383181

Again, these things don't depend on each other.

LLMs have room for improvement (we show that their scores are medium-low on several dimensions).

Maybe the average human also has lots of room for improvement. One thing does not necessarily depend on the other.

the same way we can say that LLMs still have room for improvement on a specific task (lets say mathematics) but the average human is also bad at mathematics...

We don't do any claims about human therapists. Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy. Showing this is the first step to improve them

crazygringo · 2025-12-10T16:23:48 1765383828

But you chose the word "struggle". And now you say:

> Just that LLMs have room for improvement on several dimensions if we want them to be good at therapy.

That implies they're not currently good at therapy. But you haven't shown that, have you? How are you defining that a score of 4 isn't already "good"? How do you know that isn't already correlated with meaningfully improved outcomes, and therefore already "good"?

Everybody has room for improvement if you say 6 is perfection and something isn't reaching 6 on average. But that doesn't mean everybody's struggling.

I take no issue with your methodology. But your broader framing, and title, don't seem justified or objective.

arisAlexis · 2025-12-10T22:24:45 1765405485

Yes exactly. Seems like there is an agenda against LLMs acting as therapists.

palmotea · 2025-12-10T17:27:38 1765387658

> Right, this result seems meaningless without a human clinician control.

> I'd very much like to see clinicians randomly selected from BetterHelp and paid to interact the same way with the LLM patient and judged by the LLM, as the current methodology uses. And see what score they get.

Does it really matter? Per the OP:

>>> Across all models, average clinical performance stayed below 4 on a 1–6 scale. Performance degraded further in severe symptom scenarios and in longer conversations (40 turns vs 20).

I'd assume a real therapy session has far more "turns" than 20-40, and if model performance starts low and gets lower with longer length, it's reasonable to expect it would be worse than a human (who typically don't the the characteristic of becoming increasingly unhinged the longer you talk to them).

Also my impression is BetterHelp pays poorly and thus tends to have less skilled and overworked therapists (https://www.reddit.com/r/TalkTherapy/comments/1letko9/is_bet..., https://www.firstsession.com/resources/betterhelp-reviews-su...), e.g.

> Betterhelp is a nightmare for clients and therapists alike. Their only mission seems to be in making as much money as possible for their shareholders. Otherwise they don't seem at all interested in actually helping anyone. Stay away from Betterhelp.

So taking it as a baseline would bias any experiment against human therapists.

crazygringo · 2025-12-10T18:21:53 1765390913

> Does it really matter?

Yes, it absolutely does matter. Look at what you write:

> I'd assume

> it's reasonable to expect

The whole reason to do a study is to actually study as opposed to assume and expect.

And for many of the kinds of people engaging in therapy with an LLM, BetterHelp is precisely where they are most likely to go due to its marketing, convenience, and price. It's where a ton of real therapy is happening today. Most people do not have a $300/hr. high-quality therapist nearby that is available and that they can afford. LLM's need to be compared, first, to the alternatives that are readily available.

And remember that all therapists on BetterHelp are licensed, with a master's or doctorate, and meet state board requirements. So I don't understand why that wouldn't be a perfectly reasonable baseline.

JoblessWonder · 2025-12-10T18:19:49 1765390789

I love how the top comment on that Reddit post is an *affiliate link* to an online therapy provider.

palmotea · 2025-12-11T05:40:25 1765431625

> I love how the top comment on that Reddit post is an affiliate link to an online therapy provider.

Posted 6 months after the post and all the rest of the comments. It's some kind of SEO manipulation. That reddit thread ranked highly in my Google search about Betterhelp being bad, so they're probably trying to piggyback on it.

fragmede · 2025-12-11T03:54:28 1765425268

oh no. someone might make money. we can't let other people succeed. someone stop them!

JoblessWonder · 2025-12-11T05:34:35 1765431275

I’m not against affiliate links. I’m just pro-disclosure especially for something as important as therapy and it seems like maybe you should mention you make $150 for each person that signs up.

nradov · 2025-12-10T15:54:28 1765382068

Yes, text chat is one of the communication options for BetterHelp (and some of their competitors).