The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs
- URL: http://arxiv.org/abs/2510.07775v1
- Date: Thu, 09 Oct 2025 04:30:58 GMT
- Title: The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs
- Authors: Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana,
- Abstract summary: enhancing truthfulness can negatively impact safety alignment.<n>In this paper, we show that increasing factual accuracy often comes at the cost of weakened refusal behavior.<n>We propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders.
- Score: 9.470098715212087
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety alignment.We evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.
Related papers
- When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents [90.05202259420138]
Unintended computer-use agents can deviate from expected outcomes even under benign input contexts.<n>We introduce the first conceptual and methodological framework for unintended CUA behaviors.<n>We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback.
arXiv Detail & Related papers (2026-02-09T03:20:11Z) - Detecting Misbehaviors of Large Vision-Language Models by Evidential Uncertainty Quantification [27.02252748004729]
Large vision-language models (LVLMs) have shown substantial advances in multimodal understanding and generation.<n>They frequently produce unreliable or even harmful content, such as fact hallucinations or dangerous instructions.<n>Evidential Uncertainty Quantification (EUQ) captures both information conflict and ignorance for effective detection of LVLM misbehaviors.
arXiv Detail & Related papers (2026-02-05T10:51:39Z) - ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models [17.130698952440316]
We argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space.<n>We propose ARREST, a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections.
arXiv Detail & Related papers (2026-01-07T21:04:37Z) - Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? [68.82210578851442]
We investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens.<n>Using a linear probing approach to trace refusal intentions across token positions, we discover a phenomenon termed as textbfrefusal cliff<n>We propose textbfCliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment.
arXiv Detail & Related papers (2025-10-07T15:32:59Z) - Unsupervised Hallucination Detection by Inspecting Reasoning Processes [53.15199932086543]
Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data.<n>We propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness.<n>Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
arXiv Detail & Related papers (2025-09-12T06:58:17Z) - Rethinking Safety in LLM Fine-tuning: An Optimization Perspective [56.31306558218838]
We show that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts.<n>We propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance.<n>Our experiments on the Llama families across multiple datasets demonstrate that safety problems can largely be avoided without specialized interventions.
arXiv Detail & Related papers (2025-08-17T23:46:36Z) - ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs [50.18087419133284]
hallucination detection methods leveraging hidden states predominantly focus on static and isolated representations.<n>We introduce a novel metric, the ICR Score, which quantifies the contribution of modules to the hidden states' update.<n>We propose a hallucination detection method, the ICR Probe, which captures the cross-layer evolution of hidden states.
arXiv Detail & Related papers (2025-07-22T11:44:26Z) - Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency [17.57889200051214]
Fine-tuning a general-purpose large language model (LLM) for a specific domain or task has become a routine procedure for ordinary users.<n>We consider this to be a critical failure mode of LLMs due to the widespread uptake of fine-tuning, combined with the benign nature of the "attack"<n>Our experiments expose surprising variance in the results of the safety evaluation, even when seemingly inconsequential changes are made to the fine-tuning setup.
arXiv Detail & Related papers (2025-06-20T17:57:12Z) - Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z) - Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? [73.80382983108997]
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models.<n>If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution jailbreaks.<n>We propose Concept Concentration (COCA), which simplifies the decision boundary between harmful and benign representations.
arXiv Detail & Related papers (2025-05-24T12:23:52Z) - A Knowledge-guided Adversarial Defense for Resisting Malicious Visual Manipulation [93.28532038721816]
Malicious applications of visual manipulation have raised serious threats to the security and reputation of users in many fields.<n>We propose a knowledge-guided adversarial defense (KGAD) to actively force malicious manipulation models to output semantically confusing samples.
arXiv Detail & Related papers (2025-04-11T10:18:13Z) - On Minimizing Adversarial Counterfactual Error in Adversarial RL [18.044879441434432]
adversarial noise poses significant risks in safety-critical scenarios.<n>We introduce a novel objective called Adversarial Counterfactual Error (ACoE)<n>Our method significantly outperforms current state-of-the-art approaches for addressing adversarial RL challenges.
arXiv Detail & Related papers (2024-06-07T08:14:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.