Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge
- URL: http://arxiv.org/abs/2508.08236v1
- Date: Mon, 11 Aug 2025 17:52:07 GMT
- Title: Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge
- Authors: Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan,
- Abstract summary: PsyCrisis-Bench is a reference-free evaluation benchmark based on real-world Chinese mental health dialogues.<n>It evaluates whether the model responses align with the safety principles defined by experts.<n>We present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress.
- Score: 28.534625907655776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.
Related papers
- Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing [0.4018523696542335]
Learning from human feedback assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems.<n>We tested this assumption in mental health, where high safety stakes make expert consensus essential.<n>Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random.
arXiv Detail & Related papers (2026-01-26T01:31:25Z) - Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z) - SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z) - Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs [6.0460961868478975]
We introduce a unified taxonomy of six clinically-informed mental health crisis categories.<n>We benchmark three state-of-the-art LLMs for their ability to classify crisis types and generate safe, appropriate responses.<n>We identify systemic weaknesses in handling indirect or ambiguous risk signals, a reliance on formulaic and inauthentic default replies, and frequent misalignment with user context.
arXiv Detail & Related papers (2025-09-29T14:42:23Z) - Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z) - ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z) - The Scales of Justitia: A Comprehensive Survey on Safety Evaluation of LLMs [42.57873562187369]
Large Language Models (LLMs) have demonstrated remarkable potential in the field of Natural Language Processing (NLP)<n>LLMs have occasionally exhibited unsafe elements like toxicity and bias, particularly in adversarial scenarios.<n>This survey aims to provide a comprehensive and systematic overview of recent advancements in LLMs safety evaluation.
arXiv Detail & Related papers (2025-06-06T05:50:50Z) - LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements [26.88382777632026]
HSE-Bench is the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of large language models.<n>It comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos.<n>We conduct evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models.
arXiv Detail & Related papers (2025-05-29T01:02:53Z) - Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback [51.26493826461026]
We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
arXiv Detail & Related papers (2025-05-06T08:22:51Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories [14.605576275135522]
evaluating value alignment of large language models (LLMs) has traditionally relied on single-sentence adversarial prompts.<n>We propose an upgraded value alignment benchmark that moves beyond single-sentence prompts by incorporating multi-turn dialogues and narrative-based scenarios.
arXiv Detail & Related papers (2025-03-28T03:31:37Z) - Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges [34.10494503049667]
Large language models (LLMs) are increasingly applied to outpatient referral tasks across healthcare systems.<n>There is a lack of standardized evaluation criteria to assess their effectiveness.<n>We propose a comprehensive evaluation framework specifically designed for such systems.
arXiv Detail & Related papers (2025-03-11T11:05:42Z) - A Mixed-Methods Evaluation of LLM-Based Chatbots for Menopause [7.156867036177255]
The integration of Large Language Models (LLMs) into healthcare settings has gained significant attention.<n>We examine the performance of publicly available LLM-based chatbots for menopause-related queries.<n>Our findings highlight the promise and limitations of traditional evaluation metrics for sensitive health topics.
arXiv Detail & Related papers (2025-02-05T19:56:52Z) - INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness [110.6921470281479]
We introduce INDICT: a new framework that empowers large language models with Internal Dialogues of Critiques for both safety and helpfulness guidance.
The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic.
We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes.
arXiv Detail & Related papers (2024-06-23T15:55:07Z) - CValues: Measuring the Values of Chinese Large Language Models from
Safety to Responsibility [62.74405775089802]
We present CValues, the first Chinese human values evaluation benchmark to measure the alignment ability of LLMs.
As a result, we have manually collected adversarial safety prompts across 10 scenarios and induced responsibility prompts from 8 domains.
Our findings suggest that while most Chinese LLMs perform well in terms of safety, there is considerable room for improvement in terms of responsibility.
arXiv Detail & Related papers (2023-07-19T01:22:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.