MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support
- URL: http://arxiv.org/abs/2602.00950v1
- Date: Sun, 01 Feb 2026 01:03:20 GMT
- Title: MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support
- Authors: António Farinhas, Nuno M. Guerreiro, José Pombal, Pedro Henrique Martins, Laura Melton, Alex Conway, Cara Dochat, Maya D'Eon, Ricardo Rei,
- Abstract summary: General-purpose safeguards fail to distinguish between therapeutic disclosures and genuine clinical crises.<n>We introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists.<n>We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated by clinical experts.
- Score: 9.430938712127231
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.
Related papers
- Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming [23.573537738272595]
We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with cognitive-affective models.<n>We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents.<n>Our large-scale simulation reveals critical safety gaps in the use of AI for mental health support.
arXiv Detail & Related papers (2026-02-23T15:17:18Z) - MindChat: A Privacy-preserving Large Language Model for Mental Health Support [10.332226758787277]
We present MindChat, a privacy-preserving large language model for mental health support.<n>We also present MindCorpus, a synthetic multi-turn counseling dataset constructed via a multi-agent role-playing framework.
arXiv Detail & Related papers (2026-01-05T10:54:18Z) - DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses [4.663948718816864]
We present DialogGuard, a multi-agent frame-work for assessing psychosocial risks in web-based responses.<n> DialogGuard can be applied to diverse gen- erative models through four LLM-as-a-judge pipelines.
arXiv Detail & Related papers (2025-12-01T23:53:45Z) - multiMentalRoBERTa: A Fine-tuned Multiclass Classifier for Mental Health Disorder [0.6308539010172308]
The early detection of mental health disorders from social media text is critical for enabling timely support, risk assessment, and referral to appropriate resources.<n>This work introduces multiMentalRoBERTa, a fine-tuned RoBERTa model designed for multiclass classification of common mental health conditions.
arXiv Detail & Related papers (2025-11-01T03:55:48Z) - Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs [6.0460961868478975]
We introduce a unified taxonomy of six clinically-informed mental health crisis categories.<n>We benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses.<n>We identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context.
arXiv Detail & Related papers (2025-09-29T14:42:23Z) - The Sum Leaks More Than Its Parts: Compositional Privacy Risks and Mitigations in Multi-Agent Collaboration [72.33801123508145]
Large language models (LLMs) are integral to multi-agent systems.<n>Privacy risks emerge that extend beyond memorization, direct inference, or single-turn evaluations.<n>In particular, seemingly innocuous responses, when composed across interactions, can cumulatively enable adversaries to recover sensitive information.
arXiv Detail & Related papers (2025-09-16T16:57:25Z) - BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks [58.959622170433725]
BlindGuard is an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors.<n>We show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across multi-agent systems.
arXiv Detail & Related papers (2025-08-11T16:04:47Z) - Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z) - Reframe Your Life Story: Interactive Narrative Therapist and Innovative Moment Assessment with Large Language Models [72.36715571932696]
Narrative therapy helps individuals transform problematic life stories into empowering alternatives.<n>Current approaches lack realism in specialized psychotherapy and fail to capture therapeutic progression over time.<n>Int (Interactive Narrative Therapist) simulates expert narrative therapists by planning therapeutic stages, guiding reflection levels, and generating contextually appropriate expert-like responses.
arXiv Detail & Related papers (2025-07-27T11:52:09Z) - MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis [58.67342568632529]
MoodAngels is the first specialized multi-agent framework for mood disorder diagnosis.<n>MoodSyn is an open-source dataset of 1,173 synthetic psychiatric cases.
arXiv Detail & Related papers (2025-06-04T09:18:25Z) - Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety [70.84902425123406]
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence.
However, the potential misuse of this intelligence for malicious purposes presents significant risks.
We propose a framework (PsySafe) grounded in agent psychology, focusing on identifying how dark personality traits in agents can lead to risky behaviors.
Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and dangerous behaviors.
arXiv Detail & Related papers (2024-01-22T12:11:55Z) - A Benchmark for Understanding Dialogue Safety in Mental Health Support [15.22008156903607]
This paper aims to develop a theoretically and factually grounded taxonomy that prioritizes the positive impact on help-seekers.
We analyze the dataset using popular language models, including BERT-base, RoBERTa-large, and ChatGPT.
The developed dataset and findings serve as valuable benchmarks for advancing research on dialogue safety in mental health support.
arXiv Detail & Related papers (2023-07-31T07:33:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.