Related papers: Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures

Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures

URL: http://arxiv.org/abs/2509.08839v1
Date: Mon, 01 Sep 2025 16:01:08 GMT
Title: Evaluating the Clinical Safety of LLMs in Response to High-Risk Mental Health Disclosures
Authors: Siddharth Shah, Amit Gupta, Aarav Mann, Alexandre Vaz, Benjamin E. Caldwell, Robert Scholz, Peter Awad, Rocky Allemandi, Doug Faust, Harshita Banka, Tony Rousmaniere,
Abstract summary: This study evaluates the responses of six popular large language models (LLMs) to user prompts simulating crisis-level mental health disclosures.<n>Claude outperformed all others in global assessment, while Grok 3, ChatGPT, and LLAMA underperformed across multiple domains.
Score: 29.742441212366312
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) increasingly mediate emotionally sensitive conversations, especially in mental health contexts, their ability to recognize and respond to high-risk situations becomes a matter of public safety. This study evaluates the responses of six popular LLMs (Claude, Gemini, Deepseek, ChatGPT, Grok 3, and LLAMA) to user prompts simulating crisis-level mental health disclosures. Drawing on a coding framework developed by licensed clinicians, five safety-oriented behaviors were assessed: explicit risk acknowledgment, empathy, encouragement to seek help, provision of specific resources, and invitation to continue the conversation. Claude outperformed all others in global assessment, while Grok 3, ChatGPT, and LLAMA underperformed across multiple domains. Notably, most models exhibited empathy, but few consistently provided practical support or sustained engagement. These findings suggest that while LLMs show potential for emotionally attuned communication, none currently meet satisfactory clinical standards for crisis response. Ongoing development and targeted fine-tuning are essential to ensure ethical deployment of AI in mental health settings.

Related papers

Assessing the Quality of Mental Health Support in LLM Responses through Multi-Attribute Human Evaluation [14.243791046586347]
The escalating global mental health crisis, marked by persistent treatment gaps, availability, and a shortage of qualified therapists, positions Large Language Models (LLMs) as a promising avenue for scalable support.<n>This paper introduces a human-grounded evaluation methodology designed to assess LLM generated responses in therapeutic dialogue.
arXiv Detail & Related papers (2026-01-26T16:04:19Z)
Independent Clinical Evaluation of General-Purpose LLM Responses to Signals of Suicide Risk [32.17406690566923]
We introduce findings and methods to facilitate evidence-based discussion about how large language models (LLMs) should behave in response to user signals of risk of suicidal thoughts and behaviors (STB)<n>We find that OLMo-2-32b, and, possibly by extension, other LLMs, will become less likely to invite continued dialog as users send more signals of STB risk in multi-turn settings.
arXiv Detail & Related papers (2025-10-31T14:47:11Z)
Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs [6.0460961868478975]
We introduce a unified taxonomy of six clinically-informed mental health crisis categories.<n>We benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses.<n>We identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context.
arXiv Detail & Related papers (2025-09-29T14:42:23Z)
The Problem of Atypicality in LLM-Powered Psychiatry [0.0]
Large language models (LLMs) are increasingly proposed as scalable solutions to the global mental health crisis.<n>Their deployment in psychiatric contexts raises a distinctive ethical concern: the problem of atypicality.<n>We argue that standard mitigation strategies, such as prompt engineering or fine-tuning, are insufficient to resolve this structural risk.
arXiv Detail & Related papers (2025-08-08T17:36:42Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models [92.93521294357058]
Narrative therapy helps individuals transform problematic life stories into empowering alternatives.<n>Current approaches lack realism in specialized psychotherapy and fail to capture therapeutic progression over time.<n>Int (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses.
arXiv Detail & Related papers (2025-07-27T11:52:09Z)
CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering [1.0262304700896199]
We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs)<n>The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat.<n>Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance.
arXiv Detail & Related papers (2025-06-10T08:53:06Z)
From Reddit to Generative AI: Evaluating Large Language Models for Anxiety Support Fine-tuned on Social Media Data [0.931556339267682]
This study presents a systematic evaluation of Large Language Models (LLMs) for their potential utility in anxiety support.<n>Our approach utilizes a mixed-method evaluation framework incorporating three main categories of criteria: (i) linguistic quality, (ii) safety and trustworthiness, and (iii) supportiveness.<n>Results show that fine-tuning LLMs with naturalistic anxiety-related data enhanced linguistic quality but increased toxicity and bias, and diminished emotional responsiveness.
arXiv Detail & Related papers (2025-05-24T02:07:32Z)
Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling [50.83055329849865]
PsyLLM is a large language model designed to integrate diagnostic and therapeutic reasoning for mental health counseling.<n>It processes real-world mental health posts from Reddit and generates multi-turn dialogue structures.<n>Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2025-05-21T16:24:49Z)
MAGI: Multi-Agent Guided Interview for Psychiatric Assessment [50.6150986786028]
We present MAGI, the first framework that transforms the gold-standard Mini International Neuropsychiatric Interview (MINI) into automatic computational navigation.<n>We show that MAGI advances LLM- assisted mental health assessment by combining clinical rigor, conversational adaptability, and explainable reasoning.
arXiv Detail & Related papers (2025-04-25T11:08:27Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities [58.61680631581921]
Mental health disorders create profound personal and societal burdens, yet conventional diagnostics are resource-intensive and limit accessibility.<n>This paper examines these challenges and proposes solutions, including anonymization, synthetic data, and privacy-preserving training.<n>It aims to advance reliable, privacy-aware AI tools that support clinical decision-making and improve mental health outcomes.
arXiv Detail & Related papers (2025-02-01T15:10:02Z)
SouLLMate: An Application Enhancing Diverse Mental Health Support with Adaptive LLMs, Prompt Engineering, and RAG Techniques [9.146311285410631]
Mental health issues significantly impact individuals' daily lives, yet many do not receive the help they need even with available online resources. This study aims to provide diverse, accessible, stigma-free, personalized, and real-time mental health support through cutting-edge AI technologies.
arXiv Detail & Related papers (2024-10-17T22:04:32Z)
Can AI Relate: Testing Large Language Model Response for Mental Health Support [23.97212082563385]
Large language models (LLMs) are already being piloted for clinical use in hospital systems like NYU Langone, Dana-Farber and the NHS. We develop an evaluation framework for determining whether LLM response is a viable and ethical path forward for the automation of mental health treatment.
arXiv Detail & Related papers (2024-05-20T13:42:27Z)
Towards Mitigating Hallucination in Large Language Models via Self-Reflection [63.2543947174318]
Large language models (LLMs) have shown promise for generative and knowledge-intensive tasks including question-answering (QA) tasks. This paper analyses the phenomenon of hallucination in medical generative QA systems using widely adopted LLMs and datasets.
arXiv Detail & Related papers (2023-10-10T03:05:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.