Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs
- URL: http://arxiv.org/abs/2509.24857v1
- Date: Mon, 29 Sep 2025 14:42:23 GMT
- Title: Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs
- Authors: Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, Nuria Oliver,
- Abstract summary: We introduce a unified taxonomy of six clinically-informed mental health crisis categories.<n>We benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses.<n>We identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context.
- Score: 6.0460961868478975
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The widespread use of chatbots powered by large language models (LLMs) such as ChatGPT and Llama has fundamentally reshaped how people seek information and advice across domains. Increasingly, these chatbots are being used in high-stakes contexts, including emotional support and mental health concerns. While LLMs can offer scalable support, their ability to safely detect and respond to acute mental health crises remains poorly understood. Progress is hampered by the absence of unified crisis taxonomies, robust annotated benchmarks, and empirical evaluations grounded in clinical best practices. In this work, we address these gaps by introducing a unified taxonomy of six clinically-informed mental health crisis categories, curating a diverse evaluation dataset, and establishing an expert-designed protocol for assessing response appropriateness. We systematically benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses. The results reveal that while LLMs are highly consistent and generally reliable in addressing explicit crisis disclosures, significant risks remain. A non-negligible proportion of responses are rated as inappropriate or harmful, with responses generated by an open-weight model exhibiting higher failure rates than those generated by the commercial ones. We also identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context. These findings underscore the urgent need for enhanced safeguards, improved crisis detection, and context-aware interventions in LLM deployments. Our taxonomy, datasets, and evaluation framework lay the groundwork for ongoing research and responsible innovation in AI-driven mental health support, helping to minimize harm and better protect vulnerable users.
Related papers
- MHDash: An Online Platform for Benchmarking Mental Health-Aware AI Assistants [2.89303424493]
We present MHDash, an open-source platform designed to support the development, evaluation, and auditing of AI systems for mental health applications.<n>Our results reveal several key findings, including that simple baselines and advanced LLM APIs exhibit comparable overall accuracy yet diverge significantly on high-risk cases.<n>By releasing MHDash as an open platform, we aim to promote reproducible research, transparent evaluation, and safety-aligned development of AI systems for mental health support.
arXiv Detail & Related papers (2026-01-30T22:03:31Z) - Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation [14.243791046586347]
The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support.<n>This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue.
arXiv Detail & Related papers (2026-01-26T16:04:19Z) - RubRIX: Rubric-Driven Risk Mitigation in Caregiver-AI Interactions [15.539654835961294]
We introduce RubRIX, a theory-driven, clinician-validated framework for evaluating risks in AI-mediated support responses.<n>RubRIX operationalizes five empirically-derived risk dimensions: Inattention, Bias & Stigma, Information Inaccuracy, Uncritical Affirmation, and Epistemic Arrogance.<n>This work contributes to a methodological approach for developing domain-sensitive, user-centered evaluation frameworks for high-burden contexts.
arXiv Detail & Related papers (2026-01-19T17:10:49Z) - Independent Clinical Evaluation of General-Purpose LLM Responses to Signals of Suicide Risk [32.17406690566923]
We introduce findings and methods to facilitate evidence-based discussion about how large language models (LLMs) should behave in response to user signals of risk of suicidal thoughts and behaviors (STB)<n>We find that OLMo-2-32b, and, possibly by extension, other LLMs, will become less likely to invite continued dialog as users send more signals of STB risk in multi-turn settings.
arXiv Detail & Related papers (2025-10-31T14:47:11Z) - Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures [29.742441212366312]
This study evaluates the responses of six popular large language models (LLMs) to user prompts simulating crisis-level mental health disclosures.<n>Claude outperformed all others in global assessment, while Grok 3, ChatGPT, and LLAMA underperformed across multiple domains.
arXiv Detail & Related papers (2025-09-01T16:01:08Z) - Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.<n>REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.<n>We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z) - LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise.<n>We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z) - A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare [8.378348088931578]
The application of large language models (LLMs) in healthcare holds significant promise for enhancing clinical decision-making, medical research, and patient care.<n>Their integration into real-world clinical settings raises critical concerns around trustworthiness, particularly around dimensions of truthfulness, privacy, safety, robustness, fairness, and explainability.
arXiv Detail & Related papers (2025-02-21T18:43:06Z) - Persuasion with Large Language Models: a Survey [49.86930318312291]
Large Language Models (LLMs) have created new disruptive possibilities for persuasive communication.
In areas such as politics, marketing, public health, e-commerce, and charitable giving, such LLM Systems have already achieved human-level or even super-human persuasiveness.
Our survey suggests that the current and future potential of LLM-based persuasion poses profound ethical and societal risks.
arXiv Detail & Related papers (2024-11-11T10:05:52Z) - SouLLMate: An Application Enhancing Diverse Mental Health Support with Adaptive LLMs, Prompt Engineering, and RAG Techniques [9.920107586781919]
Mental health issues significantly impact individuals' daily lives, yet many do not receive the help they need even with available online resources.<n>This study aims to provide diverse, accessible, stigma-free, personalized, and real-time mental health support through cutting-edge AI technologies.
arXiv Detail & Related papers (2024-10-17T22:04:32Z) - SouLLMate: An Adaptive LLM-Driven System for Advanced Mental Health Support and Assessment, Based on a Systematic Application Survey [9.920107586781919]
Mental health issues significantly impact individuals' daily lives, yet many do not receive the help they need even with available online resources.<n>This study aims to provide accessible, stigma-free, personalized, and real-time mental health support through cutting-edge AI technologies.
arXiv Detail & Related papers (2024-10-06T17:11:29Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Risks of AI Scientists: Prioritizing Safeguarding Over Autonomy [65.77763092833348]
This perspective examines vulnerabilities in AI scientists, shedding light on potential risks associated with their misuse.<n>We take into account user intent, the specific scientific domain, and their potential impact on the external environment.<n>We propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback.
arXiv Detail & Related papers (2024-02-06T18:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.