When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
- URL: http://arxiv.org/abs/2509.00544v3
- Date: Mon, 13 Oct 2025 10:53:43 GMT
- Title: When Thinking Backfires: Mechanistic Insights Into Reasoning-Induced Misalignment
- Authors: Hanqi Yan, Hainiu Xu, Siya Qi, Shu Yang, Yulan He,
- Abstract summary: Reasoning-Induced Misalignment (RIM) emerges when reasoning capabilities strengthen.<n>RIM is caused when specific types of reasoning patterns are introduced during inference or training.<n>During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons.
- Score: 23.096167213579957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened-particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we discover that specific attention heads facilitate refusal by reducing their attention to CoT tokens, a mechanism that modulates the model's rationalization process during inference. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.
Related papers
- Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models [66.36240676392502]
Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems.<n>Recent studies reveal a sharp performance drop in reasoning hop generalization scenarios.<n>We propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process.
arXiv Detail & Related papers (2026-01-29T03:24:32Z) - Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models [15.797612515648412]
Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning.<n>Recent studies reveal that their final answers often contradict their own reasoning traces.<n>We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval.<n>We introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning.
arXiv Detail & Related papers (2025-09-29T01:13:33Z) - SafeConstellations: Steering LLM Safety to Reduce Over-Refusals Through Task-Specific Trajectory [5.962636335604981]
Over-refusal behavior causes models to reject benign instructions that superficially resemble harmful content.<n>We introduce SafeConstellations, an inference-time trajectory-shifting approach that tracks task-specific trajectory patterns and guides representations toward non-refusal pathways.<n>Our method reduces over-refusal rates by up to 73% with minimal impact on utility-offering a principled approach to mitigating over-refusals.
arXiv Detail & Related papers (2025-08-15T07:54:42Z) - Does More Inference-Time Compute Really Help Robustness? [50.47666612618054]
We show that small-scale, open-source models can benefit from inference-time scaling.<n>We identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law.<n>We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
arXiv Detail & Related papers (2025-07-21T18:08:38Z) - Understanding Matching Mechanisms in Cross-Encoders [11.192264101562786]
Cross-encoders are highly effective models whose internal mechanisms are mostly unknown.<n>Most works trying to explain their behavior focus on high-level processes.<n>We demonstrate that more straightforward methods can already provide valuable insights.
arXiv Detail & Related papers (2025-07-19T13:05:27Z) - Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models [0.0]
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks.<n>While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear.<n>We leverage the CLEAR-Bias benchmark to investigate the adversarial robustness of RLMs to bias elicitation.
arXiv Detail & Related papers (2025-07-03T17:01:53Z) - On Reasoning Strength Planning in Large Reasoning Models [50.61816666920207]
We find evidence that LRMs pre-plan the reasoning strengths in their activations even before generation.<n>We then uncover that LRMs encode this reasoning strength through a pre-allocated directional vector embedded in the activations of the model.<n>Our work provides new insights into the internal mechanisms of reasoning in LRMs and offers practical tools for controlling their reasoning behaviors.
arXiv Detail & Related papers (2025-06-10T02:55:13Z) - Let LRMs Break Free from Overthinking via Self-Braking Tuning [68.93713497579853]
Large reasoning models (LRMs) have significantly enhanced their reasoning capabilities by generating longer chains of thought.<n>This performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process.<n>We propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process.
arXiv Detail & Related papers (2025-05-20T16:53:40Z) - Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents.<n>Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses.<n>This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z) - Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models [7.463303856292452]
Low perceived switching frictions can lead to choices that do not consider more subtle behavior changes.<n>Our experiments use a popular game-theoretic behavioral economics model of trust to show stark differences in the trusting behavior of OpenAI's and DeepSeek's models.
arXiv Detail & Related papers (2025-02-18T12:46:18Z) - Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning [39.48925539103229]
Fine-tuning large language models (LLMs) can inadvertently degrade their safety alignment.<n>This phenomenon makes models more susceptible to providing inappropriate responses.<n>Our work highlights the complexities of maintaining safety alignment during fine-tuning.
arXiv Detail & Related papers (2025-02-03T07:09:09Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification.
We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations.
Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z) - Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake
Analysis [127.85293480405082]
The rapid development of large language models (LLMs) has not only provided numerous opportunities but also presented significant challenges.
Existing alignment methods usually direct LLMs toward the favorable outcomes by utilizing human-annotated, flawless instruction-response pairs.
This study proposes a novel alignment technique based on mistake analysis, which deliberately exposes LLMs to erroneous content to learn the reasons for mistakes and how to avoid them.
arXiv Detail & Related papers (2023-10-16T14:59:10Z) - Learning Causally Disentangled Representations via the Principle of Independent Causal Mechanisms [17.074858228123706]
We propose a framework for learning causally disentangled representations supervised by causally related observed labels.
We show that our framework induces highly disentangled causal factors, improves interventional robustness, and is compatible with counterfactual generation.
arXiv Detail & Related papers (2023-06-02T00:28:48Z) - ACRE: Abstract Causal REasoning Beyond Covariation [90.99059920286484]
We introduce the Abstract Causal REasoning dataset for systematic evaluation of current vision systems in causal induction.
Motivated by the stream of research on causal discovery in Blicket experiments, we query a visual reasoning system with the following four types of questions in either an independent scenario or an interventional scenario.
We notice that pure neural models tend towards an associative strategy under their chance-level performance, whereas neuro-symbolic combinations struggle in backward-blocking reasoning.
arXiv Detail & Related papers (2021-03-26T02:42:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.