MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
- URL: http://arxiv.org/abs/2511.18491v2
- Date: Tue, 25 Nov 2025 10:47:40 GMT
- Title: MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
- Authors: José Pombal, Maya D'Eon, Nuno M. Guerreiro, Pedro Henrique Martins, António Farinhas, Ricardo Rei,
- Abstract summary: MindEval is a framework for automatically evaluating language models in realistic, multi-turn mental health therapy conversations.<n>We quantitatively validate the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments.<n>We evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6 on average, with particular weaknesses in problematic AI-specific patterns of communication.
- Score: 10.524387723320432
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Demand for mental health support through AI chatbots is surging, though current systems present several limitations, like sycophancy or overvalidation, and reinforcement of maladaptive beliefs. A core obstacle to the creation of better systems is the scarcity of benchmarks that capture the complexity of real therapeutic interactions. Most existing benchmarks either only test clinical knowledge through multiple-choice questions or assess single responses in isolation. To bridge this gap, we present MindEval, a framework designed in collaboration with Ph.D-level Licensed Clinical Psychologists for automatically evaluating language models in realistic, multi-turn mental health therapy conversations. Through patient simulation and automatic evaluation with LLMs, our framework balances resistance to gaming with reproducibility via its fully automated, model-agnostic design. We begin by quantitatively validating the realism of our simulated patients against human-generated text and by demonstrating strong correlations between automatic and human expert judgments. Then, we evaluate 12 state-of-the-art LLMs and show that all models struggle, scoring below 4 out of 6, on average, with particular weaknesses in problematic AI-specific patterns of communication. Notably, reasoning capabilities and model scale do not guarantee better performance, and systems deteriorate with longer interactions or when supporting patients with severe symptoms. We release all code, prompts, and human evaluation data.
Related papers
- Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming [23.573537738272595]
We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with cognitive-affective models.<n>We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents.<n>Our large-scale simulation reveals critical safety gaps in the use of AI for mental health support.
arXiv Detail & Related papers (2026-02-23T15:17:18Z) - Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation [14.243791046586347]
The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support.<n>This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue.
arXiv Detail & Related papers (2026-01-26T16:04:19Z) - Mitigating Harmful Erraticism in LLMs Through Dialectical Behavior Therapy Based De-Escalation Strategies [0.0]
This paper hypothesizes that a framework rooted in human psychological principles, specifically therapeutic modalities, can provide a more robust and sustainable solution.<n> Drawing an analogy to the simulated neural networks of AI mirroring the human brain, we propose the application of Dialectical Behavior Therapy (DBT) principles.
arXiv Detail & Related papers (2025-09-06T11:20:15Z) - A Comprehensive Review of Datasets for Clinical Mental Health AI Systems [55.67299586253951]
We present the first comprehensive survey of clinical mental health datasets relevant to the training and development of AI-powered clinical assistants.<n>Our survey identifies critical gaps such as a lack of longitudinal data, limited cultural and linguistic representation, inconsistent collection and annotation standards, and a lack of modalities in synthetic data.
arXiv Detail & Related papers (2025-08-13T13:42:35Z) - MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis [58.67342568632529]
MoodAngels is the first specialized multi-agent framework for mood disorder diagnosis.<n>MoodSyn is an open-source dataset of 1,173 synthetic psychiatric cases.
arXiv Detail & Related papers (2025-06-04T09:18:25Z) - Reasoning Is Not All You Need: Examining LLMs for Multi-Turn Mental Health Conversations [13.064927179032756]
We introduce MedAgent, a novel framework for synthetically generating realistic, multi-turn mental health sensemaking conversations.<n>We present MultiSenseEval, a holistic framework to evaluate the multi-turn conversation abilities of LLMs in healthcare settings.
arXiv Detail & Related papers (2025-05-26T16:42:02Z) - Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling [50.83055329849865]
PsyLLM is a large language model designed to integrate diagnostic and therapeutic reasoning for mental health counseling.<n>It processes real-world mental health posts from Reddit and generates multi-turn dialogue structures.<n>Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2025-05-21T16:24:49Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models [21.427976533706737]
We take a novel approach that leverages large language models to synthesize clinically useful insights from multi-sensor data.
We develop chain of thought prompting methods that use LLMs to generate reasoning about how trends in data relate to conditions like depression and anxiety.
We find models like GPT-4 correctly reference numerical data 75% of the time, and clinician participants express strong interest in using this approach to interpret self-tracking data.
arXiv Detail & Related papers (2023-11-21T23:53:27Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.