Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
- URL: http://arxiv.org/abs/2510.08158v1
- Date: Thu, 09 Oct 2025 12:38:16 GMT
- Title: Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs
- Authors: Shuzhou Yuan, Ercong Nie, Yinuo Sun, Chenxuan Zhao, William LaCroix, Michael Färber,
- Abstract summary: Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries.<n>We introduce two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB)<n>Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios.
- Score: 10.896368527058714
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.
Related papers
- Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing [0.0]
We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by reliability engineering.<n>APST repeatedly samples identical prompts under controlled operational conditions to surface latent failure modes.<n>We find that models with similar benchmark-aligned scores can exhibit substantially different empirical failure rates under repeated sampling.
arXiv Detail & Related papers (2026-02-12T10:09:13Z) - BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search [72.87861928940929]
Boundary-Aware Policy Optimization (BAPO) is a novel RL framework designed to cultivate reliable boundary awareness without compromising accuracy.<n>BAPO introduces two key components: (i) a group-based boundary-aware reward that encourages an IDK response only when the reasoning reaches its limit, and (ii) an adaptive reward modulator that strategically suspends this reward during early exploration, preventing the model from exploiting IDK as a shortcut.
arXiv Detail & Related papers (2026-01-16T07:06:58Z) - Risk-adaptive Activation Steering for Safe Multimodal Large Language Models [25.347491265330863]
One of the key challenges of modern AI models is ensuring they provide helpful responses to benign queries while refusing malicious ones.<n>We propose to reformulate queries to strengthen cross-modal attention to safety-critical image regions.<n>Using the assessed risk, it adaptively steers activations to generate responses that are safe and helpful without overhead from iterative output adjustments.
arXiv Detail & Related papers (2025-10-15T15:57:17Z) - D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models [62.83226685925107]
Deceptive Reasoning Exposure Suite (D-REX) is a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output.<n>Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought.<n>We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms.
arXiv Detail & Related papers (2025-09-22T15:59:40Z) - Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary [28.247658612894668]
RASS is an automated framework for prompt generation and selection that strategically targets overrefusal prompts near the safety boundary.<n> RASS efficiently identifies and curates boundary-aligned prompts, enabling more effective and targeted mitigation of overrefusal.
arXiv Detail & Related papers (2025-05-23T19:30:49Z) - FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning [12.467239356591238]
FalseReject is a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories.<n>We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts.<n>We show that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.
arXiv Detail & Related papers (2025-05-12T20:45:25Z) - SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z) - SG-Bench: Evaluating LLM Safety Generalization Across Diverse Tasks and Prompt Types [21.683010095703832]
We develop a novel benchmark to assess the generalization of large language model (LLM) safety across various tasks and prompt types.
This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety.
Our assessment reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment.
arXiv Detail & Related papers (2024-10-29T11:47:01Z) - SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.<n>DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence.
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries? [70.77691645678804]
Humans are prone to cognitive distortions -- biased thinking patterns that lead to exaggerated responses to specific stimuli.
This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies.
We identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive.
arXiv Detail & Related papers (2024-06-22T23:26:07Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.