Related papers: When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

URL: http://arxiv.org/abs/2510.19032v1
Date: Tue, 21 Oct 2025 19:21:21 GMT
Title: When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
Authors: Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Jimmy Huang, Frank Rudzicz, Elham Dolatabadi,
Abstract summary: MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses.<n>MentalBench-70kreframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes.<n>Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance.
Score: 14.24379104658635
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/

Related papers

On the Reliability of User-Centric Evaluation of Conversational Recommender Systems [0.9112926574395824]
We present a large-scale empirical study investigating the reliability of user-centric CRS evaluation on static dialogue transcripts.<n>We collect 1,053 annotations from 124 crowd workers on 200 ReDial dialogues using the 18-dimensional CRS-Que framework.<n>Our results show that utilitarian and outcome-oriented dimensions such as accuracy, usefulness, and satisfaction achieve moderate reliability under aggregation.<n>Many dimensions collapse into a single global quality signal, revealing a strong halo effect in third-party judgments.
arXiv Detail & Related papers (2026-02-19T11:10:11Z)
Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation [14.243791046586347]
The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support.<n>This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue.
arXiv Detail & Related papers (2026-01-26T16:04:19Z)
Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems [2.9141470183751674]
We propose a dynamic, learning-based framework for scalable and context-aware evaluation.<n>Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts.<n> Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines.
arXiv Detail & Related papers (2025-12-01T15:26:20Z)
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support [10.524387723320432]
MindEval is a framework for automatically evaluating language models in realistic, multi-turn mental health therapy conversations.<n>We quantitatively validate the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments.<n>We evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6 on average, with particular weaknesses in problematic AI-specific patterns of communication.
arXiv Detail & Related papers (2025-11-23T15:19:29Z)
Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs [6.0460961868478975]
We introduce a unified taxonomy of six clinically-informed mental health crisis categories.<n>We benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses.<n>We identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context.
arXiv Detail & Related papers (2025-09-29T14:42:23Z)
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them [58.04324690859212]
Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks.<n>We identify two fundamental types of inconsistencies: Score-Comparison Inconsistency and Pairwise Transitivity Inconsistency.<n>We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations.
arXiv Detail & Related papers (2025-09-25T13:04:29Z)
Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge [28.534625907655776]
PsyCrisis-Bench is a reference-free evaluation benchmark based on real-world Chinese mental health dialogues.<n>It evaluates whether the model responses align with the safety principles defined by experts.<n>We present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress.
arXiv Detail & Related papers (2025-08-11T17:52:07Z)
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Real-World Summarization: When Evaluation Reaches Its Limits [1.4197924572122094]
We compare traditional metrics, trainable methods, and LLM-as-a-judge approaches.<n>Our findings reveal that simpler metrics like word overlap surprisingly well with human judgments.<n>Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks.
arXiv Detail & Related papers (2025-07-15T17:23:56Z)
Are LLM-generated plain language summaries truly understandable? A large-scale crowdsourced evaluation [7.867257950096845]
Plain language summaries (PLSs) are essential for facilitating effective communication between clinicians and patients.<n>Large language models (LLMs) have recently shown promise in automating PLS generation, but their effectiveness in supporting health information comprehension remains unclear.<n>We conducted a large-scale crowdsourced evaluation of LLM-generated PLSs using Amazon Mechanical Turk with 150 participants.<n>Our findings indicate that while LLMs can generate PLSs that appear indistinguishable from human-written ones in subjective evaluations, human-written PLSs lead to significantly better comprehension.
arXiv Detail & Related papers (2025-05-15T15:31:17Z)
Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback [51.26493826461026]
We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
arXiv Detail & Related papers (2025-05-06T08:22:51Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z)
Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance [73.19687314438133]
We study how reliance is affected by contextual features of an interaction. We find that contextual characteristics significantly affect human reliance behavior. Our results show that calibration and language quality alone are insufficient in evaluating the risks of human-LM interactions.
arXiv Detail & Related papers (2024-07-10T18:00:05Z)
Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models. Reference-free evaluators are more suitable for open-ended examples with different semantics responses. There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.