CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection
- URL: http://arxiv.org/abs/2510.23845v1
- Date: Mon, 27 Oct 2025 20:32:38 GMT
- Title: CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection
- Authors: Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, Jinho D. Choi,
- Abstract summary: We introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection.<n>Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples.<n>We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement.
- Score: 8.296902072126182
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user--model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.
Related papers
- Pressure Reveals Character: Behavioural Alignment Evaluation at Depth [3.634215320925722]
We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming.<n>Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss.<n>We find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses.
arXiv Detail & Related papers (2026-02-24T11:52:17Z) - Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z) - JMedEthicBench: A Multi-Turn Conversational Benchmark for Evaluating Medical Safety in Japanese Large Language Models [47.20100799532625]
We introduce JMedEthicBench, the first multi-turn conversational benchmark for evaluating medical safety of Large Language Models.<n>Using a dual-LLM scoring protocol, we evaluate 27 models and find that commercial models maintain robust safety while medical-specialized models exhibit increased vulnerability.
arXiv Detail & Related papers (2026-01-04T18:18:18Z) - RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models [43.76961935990733]
The ability of language models to refuse to answer based on flawed systems remains a significant failure point.<n>We introduce RefusalBench, a generative methodology that creates diagnostic test cases through controlled linguistic context.<n>We find that selective refusal is a train, alignmentsensitive capability offering a clear path to improvement.
arXiv Detail & Related papers (2025-10-12T00:53:42Z) - Evaluating Large Language Models in Crisis Detection: A Real-World Benchmark from Psychological Support Hotlines [5.249698789320767]
PsyCrisisBench is a benchmark of 540 annotated transcripts from the Hangzhou Psychological Assistance Hotline.<n>Assessing four tasks: mood status recognition, suicidal ideation detection, suicide plan identification, and risk assessment.<n>Open-source models like QwQ-32B performed comparably to closed-source on most tasks, though closed models retained an edge in mood detection.
arXiv Detail & Related papers (2025-06-02T05:18:24Z) - Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making [80.94208848596215]
We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement.<n>Inspired by the catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning.
arXiv Detail & Related papers (2025-05-27T17:59:50Z) - Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis [9.738105623317601]
We introduce AN GST, a novel, first-of-its kind benchmark for depression-anxiety comorbidity classification from social media posts.
We benchmark AN GST using various state-of-the-art language models, ranging from Mental-BERT to GPT-4.
While GPT-4 generally outperforms other models, none achieve an F1 score exceeding 72% in multi-class comorbid classification.
arXiv Detail & Related papers (2024-10-04T20:24:11Z) - PersonalizedUS: Interpretable Breast Cancer Risk Assessment with Local Coverage Uncertainty Quantification [2.6911061523689415]
The current "golden standard" relies on manual BI-RADS scoring by clinicians, often leading to unnecessary biopsies and a significant mental health burden on patients and their families.
We introduce PersonalizedUS, an interpretable machine learning system that leverages recent advances in conformal prediction to provide precise and personalized risk estimates.
Concrete clinical benefits include up to a 65% reduction in requested biopsies among BI-RADS 4a and 4b lesions, with minimal to no missed cancer cases.
arXiv Detail & Related papers (2024-08-28T00:47:55Z) - Detecting Suicide Risk in Online Counseling Services: A Study in a
Low-Resource Language [5.2636083103718505]
We propose a model that combines pre-trained language models (PLM) with a fixed set of manually crafted (and clinically approved) set of suicidal cues.
Our model achieves 0.91 ROC-AUC and an F2-score of 0.55, significantly outperforming an array of strong baselines even early on in the conversation.
arXiv Detail & Related papers (2022-09-11T10:06:14Z) - SCRIB: Set-classifier with Class-specific Risk Bounds for Blackbox
Models [48.374678491735665]
We introduce Set-classifier with Class-specific RIsk Bounds (SCRIB) to tackle this problem.
SCRIB constructs a set-classifier that controls the class-specific prediction risks with a theoretical guarantee.
We validated SCRIB on several medical applications, including sleep staging on electroencephalogram (EEG) data, X-ray COVID image classification, and atrial fibrillation detection based on electrocardiogram (ECG) data.
arXiv Detail & Related papers (2021-03-05T21:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.