Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing
- URL: http://arxiv.org/abs/2601.18061v2
- Date: Fri, 30 Jan 2026 18:45:34 GMT
- Title: Expert Evaluation and the Limits of Human Feedback in Mental Health AI Safety Testing
- Authors: Kiana Jafari, Paul Ulrich Nikolaus Rust, Duncan Eddy, Robbie Fraser, Nina Vasan, Darja Djordjevic, Akanksha Dadlani, Max Lamparth, Eugenia Kim, Mykel Kochenderfer,
- Abstract summary: Learning from human feedback assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems.<n>We tested this assumption in mental health, where high safety stakes make expert consensus essential.<n>Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random.
- Score: 0.4018523696542335
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning from human feedback~(LHF) assumes that expert judgments, appropriately aggregated, yield valid ground truth for training and evaluating AI systems. We tested this assumption in mental health, where high safety stakes make expert consensus essential. Three certified psychiatrists independently evaluated LLM-generated responses using a calibrated rubric. Despite similar training and shared instructions, inter-rater reliability was consistently poor ($ICC$ $0.087$--$0.295$), falling below thresholds considered acceptable for consequential assessment. Disagreement was highest on the most safety-critical items. Suicide and self-harm responses produced greater divergence than any other category, and was systematic rather than random. One factor yielded negative reliability (Krippendorff's $α= -0.203$), indicating structured disagreement worse than chance. Qualitative interviews revealed that disagreement reflects coherent but incompatible individual clinical frameworks, safety-first, engagement-centered, and culturally-informed orientations, rather than measurement error. By demonstrating that experts rely on holistic risk heuristics rather than granular factor discrimination, these findings suggest that aggregated labels function as arithmetic compromises that effectively erase grounded professional philosophies. Our results characterize expert disagreement in safety-critical AI as a sociotechnical phenomenon where professional experience introduces sophisticated layers of principled divergence. We discuss implications for reward modeling, safety classification, and evaluation benchmarks, recommending that practitioners shift from consensus-based aggregation to alignment methods that preserve and learn from expert disagreement.
Related papers
- CARE: An Explainable Computational Framework for Assessing Client-Perceived Therapeutic Alliance Using Large Language Models [19.027335814014528]
We present CARE, an LLM-based framework to automatically predict multi-dimensional alliance scores and generate interpretable rationales from counseling transcripts.<n> CARE is built on the CounselingWAI dataset and enriched with 9,516 expert-curated rationales.<n>Experiments show that CARE outperforms leading LLMs and substantially reduces the gap between counselor evaluations and client-perceived alliance.
arXiv Detail & Related papers (2026-02-24T07:52:56Z) - JADE: Expert-Grounded Dynamic Evaluation for Open-Ended Professional Tasks [14.14645345504797]
We propose JADE, a two-layer evaluation framework for agentic AI.<n> Layer 1 encodes expert knowledge as a predefined set of evaluation skills.<n> Layer 2 performs report-specific, claim-level evaluation to flexibly assess diverse reasoning strategies.
arXiv Detail & Related papers (2026-02-06T08:26:09Z) - The Evaluation Gap in Medicine, AI and LLMs: Navigating Elusive Ground Truth & Uncertainty via a Probabilistic Paradigm [49.287792149338976]
We introduce a probabilistic paradigm to theoretically explain how high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores.<n>We thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability.
arXiv Detail & Related papers (2026-01-09T03:19:37Z) - PsychEthicsBench: Evaluating Large Language Models Against Australian Mental Health Ethics [35.52940216380734]
In mental health, clinically inadequate refusals can be perceived as unempathetic and discourage help-seeking.<n>To address this gap, we move beyond refusal-centric metrics and introduce textttPsychEthicsBench, the first principle-grounded benchmark based on Australian psychology and psychiatry guidelines.<n> Empirical results across 14 models reveal that refusal rates are poor indicators of ethical behavior, revealing a significant divergence between safety triggers and clinical appropriateness.
arXiv Detail & Related papers (2026-01-07T04:49:02Z) - SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models [60.8821834954637]
We present SafeRBench, the first benchmark that assesses LRM safety end-to-end.<n>We pioneer the incorporation of risk categories and levels into input design.<n>We introduce a micro-thought chunking mechanism to segment long reasoning traces into semantically coherent units.
arXiv Detail & Related papers (2025-11-19T06:46:33Z) - EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers [8.123835490773095]
Large Language Models for Simulating Professions (SP-LLMs) are pivotal for personalized education.<n>EduGuardBench assesses professional fidelity using a Role-playing Fidelity Score (RFS)<n>It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct.
arXiv Detail & Related papers (2025-11-10T09:42:24Z) - Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge [28.534625907655776]
PsyCrisis-Bench is a reference-free evaluation benchmark based on real-world Chinese mental health dialogues.<n>It evaluates whether the model responses align with the safety principles defined by experts.<n>We present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress.
arXiv Detail & Related papers (2025-08-11T17:52:07Z) - Detect \& Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning [57.35282510032077]
Federated learning with secure aggregation enables private and collaborative learning from decentralised data without leaking sensitive client information.<n>QI and FedGT were proposed for contribution evaluation (CE) and misbehaviour detection (MD), respectively.<n>We combine the strengths of QI and FedGT to achieve both robust MD and accurate CE.
arXiv Detail & Related papers (2025-06-30T07:40:18Z) - Governance Challenges in Reinforcement Learning from Human Feedback: Evaluator Rationality and Reinforcement Stability [2.3961612657966946]
Reinforcement Learning from Human Feedback (RLHF) is central in aligning large language models with human values and expectations.<n>This study examines how the cognitive capacity of evaluators, specifically their level of rationality, affects the stability of reinforcement signals.
arXiv Detail & Related papers (2025-04-17T19:10:00Z) - AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability.<n>The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z) - CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants [5.7605009639020315]
Assessment of ten leading models across five scenarios (with 337 use cases each)<n>We find that top-rated "harmless" models make recommendations that should be recognised as obviously harmful to the user given the context provided.<n>Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising desires above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge.
arXiv Detail & Related papers (2024-10-28T15:59:31Z) - ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking.
We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert.
We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z) - The Pitfalls and Promise of Conformal Inference Under Adversarial Attacks [90.52808174102157]
In safety-critical applications such as medical imaging and autonomous driving, it is imperative to maintain both high adversarial robustness to protect against potential adversarial attacks.
A notable knowledge gap remains concerning the uncertainty inherent in adversarially trained models.
This study investigates the uncertainty of deep learning models by examining the performance of conformal prediction (CP) in the context of standard adversarial attacks.
arXiv Detail & Related papers (2024-05-14T18:05:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.