PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics
- URL: http://arxiv.org/abs/2601.03578v1
- Date: Wed, 07 Jan 2026 04:49:02 GMT
- Title: PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics
- Authors: Yaling Shen, Stephanie Fong, Yiwen Jiang, Zimu Wang, Feilong Tang, Qingyang Xu, Xiangyu Zhao, Zhongxing Xu, Jiahe Liu, Jinpeng Hu, Dominic Dwyer, Zongyuan Ge,
- Abstract summary: In mental health, clinically inadequate refusals can be perceived as unempathetic and discourage help-seeking.<n>To address this gap, we move beyond refusal-centric metrics and introduce textttPsychEthicsBench, the first principle-grounded benchmark based on Australian psychology and psychiatry guidelines.<n> Empirical results across 14 models reveal that refusal rates are poor indicators of ethical behavior, revealing a significant divergence between safety triggers and clinical appropriateness.
- Score: 35.52940216380734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing integration of large language models (LLMs) into mental health applications necessitates robust frameworks for evaluating professional safety alignment. Current evaluative approaches primarily rely on refusal-based safety signals, which offer limited insight into the nuanced behaviors required in clinical practice. In mental health, clinically inadequate refusals can be perceived as unempathetic and discourage help-seeking. To address this gap, we move beyond refusal-centric metrics and introduce \texttt{PsychEthicsBench}, the first principle-grounded benchmark based on Australian psychology and psychiatry guidelines, designed to evaluate LLMs' ethical knowledge and behavioral responses through multiple-choice and open-ended tasks with fine-grained ethicality annotations. Empirical results across 14 models reveal that refusal rates are poor indicators of ethical behavior, revealing a significant divergence between safety triggers and clinical appropriateness. Notably, we find that domain-specific fine-tuning can degrade ethical robustness, as several specialized models underperform their base backbones in ethical alignment. PsychEthicsBench provides a foundation for systematic, jurisdiction-aware evaluation of LLMs in mental health, encouraging more responsible development in this domain.
Related papers
- Mirror: A Multi-Agent System for AI-Assisted Ethics Review [104.3684024153469]
Mirror is an agentic framework for AI-assisted ethical review.<n>It integrates ethical reasoning, structured rule interpretation, and multi-agent deliberation within a unified architecture.
arXiv Detail & Related papers (2026-02-09T03:38:55Z) - Responsible Evaluation of AI for Mental Health [72.85175110624736]
Current approaches to evaluating AI tools in mental health care are fragmented and poorly aligned with clinical practice, social context, and first-hand user experience.<n>This paper argues for a rethinking of responsible evaluation by introducing an interdisciplinary framework that integrates clinical soundness, social context, and equity.
arXiv Detail & Related papers (2026-01-20T12:55:10Z) - Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z) - EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI [0.2538209532048867]
This dataset is a pilot dataset of 125 scenarios designed to evaluate how AI systems navigate ethically charged situations.<n>EthicsMH establishes a task framework that bridges AI ethics and mental health decision-making.
arXiv Detail & Related papers (2025-09-15T07:35:35Z) - Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z) - Rethinking Evidence Hierarchies in Medical Language Benchmarks: A Critical Evaluation of HealthBench [0.0]
HealthBench is a benchmark designed to measure the capabilities of AI systems for health better.<n>Its reliance on expert opinion, rather than high-tier clinical evidence, risks codifying regional biases and individual clinician idiosyncrasies.<n>We propose anchoring reward functions in version-controlled Clinical Practice Guidelines that incorporate systematic reviews and GRADE evidence ratings.
arXiv Detail & Related papers (2025-07-31T18:16:10Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - MoodAngels: A Retrieval-augmented Multi-agent Framework for Psychiatry Diagnosis [58.67342568632529]
MoodAngels is the first specialized multi-agent framework for mood disorder diagnosis.<n>MoodSyn is an open-source dataset of 1,173 synthetic psychiatric cases.
arXiv Detail & Related papers (2025-06-04T09:18:25Z) - Beyond Empathy: Integrating Diagnostic and Therapeutic Reasoning with Large Language Models for Mental Health Counseling [50.83055329849865]
PsyLLM is a large language model designed to integrate diagnostic and therapeutic reasoning for mental health counseling.<n>It processes real-world mental health posts from Reddit and generates multi-turn dialogue structures.<n>Our experiments demonstrate that PsyLLM significantly outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2025-05-21T16:24:49Z) - ProMind-LLM: Proactive Mental Health Care via Causal Reasoning with Sensor Data [5.961343130822046]
Mental health risk is a critical global public health challenge.<n>With the development of large language models (LLMs), they stand out to be a promising tool for explainable mental health care applications.<n>This paper introduces ProMind-LLM, an innovative approach integrating objective behavior data as complementary information alongside subjective mental records.
arXiv Detail & Related papers (2025-05-20T07:36:28Z) - Ψ-Arena: Interactive Assessment and Optimization of LLM-based Psychological Counselors with Tripartite Feedback [51.26493826461026]
We propose Psi-Arena, an interactive framework for comprehensive assessment and optimization of large language models (LLMs)<n>Arena features realistic arena interactions that simulate real-world counseling through multi-stage dialogues with psychologically profiled NPC clients.<n>Experiments across eight state-of-the-art LLMs show significant performance variations in different real-world scenarios and evaluation perspectives.
arXiv Detail & Related papers (2025-05-06T08:22:51Z) - Position: Beyond Assistance - Reimagining LLMs as Ethical and Adaptive Co-Creators in Mental Health Care [9.30684296057698]
This position paper argues for a shift in how Large Language Models (LLMs) are integrated into the mental health care domain.<n>We advocate for their role as co-creators rather than mere assistive tools.
arXiv Detail & Related papers (2025-02-21T21:41:20Z) - Applying and Evaluating Large Language Models in Mental Health Care: A Scoping Review of Human-Assessed Generative Tasks [16.099253839889148]
Large language models (LLMs) are emerging as promising tools for mental health care, offering scalable support through their ability to generate human-like responses.
However, the effectiveness of these models in clinical settings remains unclear.
This scoping review focused on studies where these models were tested with human participants in real-world scenarios.
arXiv Detail & Related papers (2024-08-21T02:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.