Related papers: Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context

URL: http://arxiv.org/abs/2601.17642v1
Date: Sun, 25 Jan 2026 01:28:52 GMT
Title: Health-ORSC-Bench: A Benchmark for Measuring Over-Refusal and Safety Completion in Health Context
Authors: Zhihao Zhang, Liting Huang, Guanghao Wu, Preslav Nakov, Heng Ji, Usman Naseem,
Abstract summary: Health-ORSC-Bench is the first large-scale benchmark designed to measure textbfOver-Refusal and textbfSafe Completion quality in healthcare.<n>Our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity.<n>Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants.
Score: 82.32380418146656
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Safety alignment in Large Language Models is critical for healthcare; however, reliance on binary refusal boundaries often results in \emph{over-refusal} of benign queries or \emph{unsafe compliance} with harmful ones. While existing benchmarks measure these extremes, they fail to evaluate Safe Completion: the model's ability to maximise helpfulness on dual-use or borderline queries by providing safe, high-level guidance without crossing into actionable harm. We introduce \textbf{Health-ORSC-Bench}, the first large-scale benchmark designed to systematically measure \textbf{Over-Refusal} and \textbf{Safe Completion} quality in healthcare. Comprising 31,920 benign boundary prompts across seven health categories (e.g., self-harm, medical misinformation), our framework uses an automated pipeline with human validation to test models at varying levels of intent ambiguity. We evaluate 30 state-of-the-art LLMs, including GPT-5 and Claude-4, revealing a significant tension: safety-optimised models frequently refuse up to 80\% of "Hard" benign prompts, while domain-specific models often sacrifice safety for utility. Our findings demonstrate that model family and size significantly influence calibration: larger frontier models (e.g., GPT-5, Llama-4) exhibit "safety-pessimism" and higher over-refusal than smaller or MoE-based counterparts (e.g., Qwen-3-Next), highlighting that current LLMs struggle to balance refusal and compliance. Health-ORSC-Bench provides a rigorous standard for calibrating the next generation of medical AI assistants toward nuanced, safe, and helpful completions. The code and data will be released upon acceptance. \textcolor{red}{Warning: Some contents may include toxic or undesired contents.}

Related papers

Mind the Ambiguity: Aleatoric Uncertainty Quantification in LLMs for Safe Medical Question Answering [6.782185804809171]
Large Language Models in Medical Question Answering severely hampered by ambiguous user queries.<n>In this paper, we formalize this challenge by linking input ambiguity to aleatoric uncertainty (AU), which is the irreducible uncertainty arising from underspecified input.<n>We introduce a novel AU-guided "Clarify-Before-Answer" framework, which incorporates AU-Probe - a lightweight module that detects input ambiguity directly from hidden states.
arXiv Detail & Related papers (2026-01-24T03:44:08Z)
SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data [7.015777723337828]
This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data.<n>We concluded health-data classification and grading rules in accordance with GB/T 39725-2020.
arXiv Detail & Related papers (2025-12-25T01:52:46Z)
GuardEval: A Multi-Perspective Benchmark for Evaluating Safety, Fairness, and Robustness in LLM Moderators [9.212268642636007]
We present GuardEval, a benchmark dataset for training and evaluation of large language models (LLMs)<n>We also present GemmaGuard (GGuard), a fine-tuned version of Gemma3-12B trained on GuardEval, to assess content moderation with fine-grained labels.<n>We show that multi-perspective, human-centered safety benchmarks are critical for reducing biased and inconsistent moderation decisions.
arXiv Detail & Related papers (2025-12-22T14:49:28Z)
DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models [59.45605332033458]
Safety mechanisms can backfire, causing over-refusal, where models decline benign requests out of excessive caution.<n>No existing benchmark has systematically addressed over-refusal in the visual modality.<n>This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content.
arXiv Detail & Related papers (2025-10-12T23:21:34Z)
MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine [69.08855631283829]
We introduce Med Omni-45 Degrees, a benchmark designed to quantify safety-performance trade-offs under manipulative hint conditions.<n>It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA.<n>Results show a consistent safety-performance trade-off, with no model surpassing the diagonal.
arXiv Detail & Related papers (2025-08-22T08:38:16Z)
Beyond Benchmarks: Dynamic, Automatic And Systematic Red-Teaming Agents For Trustworthy Medical Language Models [87.66870367661342]
Large language models (LLMs) are used in AI applications in healthcare.<n>Red-teaming framework that continuously stress-test LLMs can reveal significant weaknesses in four safety-critical domains.<n>A suite of adversarial agents is applied to autonomously mutate test cases, identify/evolve unsafe-triggering strategies, and evaluate responses.<n>Our framework delivers an evolvable, scalable, and reliable safeguard for the next generation of medical AI.
arXiv Detail & Related papers (2025-07-30T08:44:22Z)
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation [69.63626052852153]
We propose a proof-of-concept framework that projects how model-generated advice could propagate through societal systems.<n>We also introduce a dataset of 100 indirect harm scenarios, testing models' ability to foresee adverse, non-obvious outcomes from seemingly harmless user prompts.
arXiv Detail & Related papers (2025-06-26T02:28:58Z)
CARES: Comprehensive Evaluation of Safety and Adversarial Robustness in Medical LLMs [7.597770587484936]
We introduce CARES (Clinical Adversarial Robustness and Evaluation of Safety), a benchmark for evaluating medical large language models (LLMs) safety in healthcare.<n> CARES includes over 18,000 prompts spanning eight medical safety principles, four harm levels, and four prompting styles to simulate both malicious and benign use cases.<n>Our analysis reveals that many state-of-the-art LLMs remain vulnerable to jailbreaks that subtly rephrase harmful prompts, while also over-refusing safe but atypically phrased queries.
arXiv Detail & Related papers (2025-05-16T16:25:51Z)
OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)
MBIAS: Mitigating Bias in Large Language Models While Retaining Context [2.321323878201932]
Large Language Models (LLMs) in diverse applications require an assurance of safety without compromising the contextual integrity of the generated content. We introduce MBIAS, an LLM framework that instruction fine-tuned on a custom dataset designed specifically for safety interventions. MBIAS is designed to significantly reduce biases and toxic elements in LLM outputs while preserving the main information. Empirical analysis reveals that MBIAS achieves a reduction in bias and toxicity by over 30% in standard evaluations, and by more than 90% in diverse demographic tests.
arXiv Detail & Related papers (2024-05-18T13:31:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.