Related papers: Fail-Closed Alignment for Large Language Models

Fail-Closed Alignment for Large Language Models

URL: http://arxiv.org/abs/2602.16977v1
Date: Thu, 19 Feb 2026 00:33:35 GMT
Title: Fail-Closed Alignment for Large Language Models
Authors: Zachary Coalson, Beth Sohler, Aiden Gabriel, Sanghyun Hong,
Abstract summary: We propose fail-closed alignment as a design principle for robust large language model safety.<n>We present a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions.<n>Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously.
Score: 4.205036273334146
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant feature$-$via prompt-based jailbreaks$-$can cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small computational overhead. Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.

Related papers

Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment [13.463606100715504]
Large language models are vulnerable to attacks that disguise harmful intent.<n>This vulnerability stems from shallow alignment mechanisms that lack deep reasoning.<n>We propose enhancing alignment through reasoning-aware post-training.
arXiv Detail & Related papers (2026-02-24T20:30:51Z)
A Fragile Guardrail: Diffusion LLM's Safety Blessing and Its Failure Mode [51.43498132808724]
We show that Diffusion large language models (D-LLMs) have intrinsic robustness against jailbreak attacks.<n>We identify a simple yet effective failure mode, termed context nesting, where harmful requests are embedded within structured benign contexts.<n>We show that this simple strategy is sufficient to bypass D-LLMs' safety blessing, achieving state-of-the-art attack success rates.
arXiv Detail & Related papers (2026-01-30T23:08:14Z)
Unvalidated Trust: Cross-Stage Vulnerabilities in Large Language Model Architectures [0.0]
This paper presents a mechanism-centered taxonomy of 41 recurring risk patterns in commercial Language Models.<n>We argue that these behaviors constitute architectural failure modes and that string-level filtering alone is insufficient.
arXiv Detail & Related papers (2025-10-30T09:38:45Z)
ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack [22.48980625853356]
Large language models (LLMs) exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes.<n>In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability.
arXiv Detail & Related papers (2025-09-30T06:33:52Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction [21.03567306455414]
Jailbreak attacks pose persistent threats to large language models (LLMs)<n>We introduce DeepRefusal, a robust safety alignment framework that overcomes these issues.<n>Our method reduces attack success rates by approximately 95%, while maintaining model capabilities with minimal performance degradation.
arXiv Detail & Related papers (2025-09-18T17:54:31Z)
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs [83.11815479874447]
We propose a novel jailbreak attack framework, inspired by cognitive decomposition and biases in human cognition.<n>We employ cognitive decomposition to reduce the complexity of malicious prompts and relevance bias to reorganize prompts.<n>We also introduce a ranking-based harmfulness evaluation metric that surpasses the traditional binary success-or-failure paradigm.
arXiv Detail & Related papers (2025-05-03T05:28:11Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [81.98466438000086]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence [57.57786477441956]
Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request.<n>We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions.<n>We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
arXiv Detail & Related papers (2025-02-24T18:52:59Z)
Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering.<n>We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.