The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
- URL: http://arxiv.org/abs/2602.15799v1
- Date: Tue, 17 Feb 2026 18:39:15 GMT
- Title: The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
- Authors: Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova,
- Abstract summary: Fine-tuning language models on benign tasks unpredictably degrade safety guardrails.<n>We prove that alignment concentrates in low-dimensional subspaces with sharp curvature.<n>We formalize this mechanism through the Alignment Instability Condition.
- Score: 40.556122962771276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.
Related papers
- When Backdoors Go Beyond Triggers: Semantic Drift in Diffusion Models Under Encoder Attacks [2.4923006485141284]
We demonstrate that encoder-side poisoning induces persistent, trigger-free semantic corruption.<n> backdoors act as low-rank, target-centered deformations that amplify local sensitivity, causing distortion to propagate coherently across semantic neighborhoods.<n>Our findings, validated across diffusion and contrastive paradigms, expose the deep structural risks of encoder poisoning and highlight the necessity of geometric audits beyond simple attack success rates.
arXiv Detail & Related papers (2026-02-21T23:48:04Z) - Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking [0.0]
Grokking -- the delayed transition from memorization to generalization in small tasks -- remains poorly understood.<n> PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace.<n>We find that curvature grows sharply in directions to the execution subspace while the trajectory remains largely confined to it.
arXiv Detail & Related papers (2026-02-18T03:57:56Z) - Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection [52.551864761088574]
Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility.<n>We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment.<n>We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA) to balance plasticity and stability.
arXiv Detail & Related papers (2026-02-08T09:53:46Z) - Riemannian Flow Matching for Disentangled Graph Domain Adaptation [51.98961391065951]
Graph Domain Adaptation (GDA) typically uses adversarial learning to align graph embeddings in Euclidean space.<n>DisRFM is a geometry-aware GDA framework that unifies embedding and flow-based transport.
arXiv Detail & Related papers (2026-01-31T11:05:35Z) - Geometric and Dynamic Scaling in Deep Transformers [13.697614668609205]
We argue that the collapse of deep Transformers is fundamentally a geometric problem.<n>We propose a unified geometric framework that addresses these failures through two principles.<n>Our analysis predicts that enforcing geometric validity while allowing dynamic erasure is essential for avoiding rank collapse in ultra-deep networks.
arXiv Detail & Related papers (2026-01-03T00:41:46Z) - Geometric-Disentangelment Unlearning [106.99160454669902]
gradient ascent on forget samples often harms retained knowledge.<n>We propose the Geometric-disment Unlearning (GU) that decomposes any candidate forget gradient update into tangential and normal components to retain space and executes only the normal component.<n>Our method is plug-and-play and can be attached to existing gradient-based unlearning procedures to mitigate side effects.
arXiv Detail & Related papers (2025-11-21T09:58:25Z) - Geometry-Aware Backdoor Attacks: Leveraging Curvature in Hyperbolic Embeddings [3.8806403512213787]
Non-Euclidean foundation models place representations in curved spaces such as hyperbolic geometry.<n>Small input changes appear subtle to standard input-space detectors but produce disproportionately large shifts in the model's representation space.<n>We propose a geometry-adaptive trigger and evaluate it across tasks and architectures.
arXiv Detail & Related papers (2025-10-07T19:24:43Z) - Adaptive Dual Uncertainty Optimization: Boosting Monocular 3D Object Detection under Test-Time Shifts [80.32933059529135]
Test-Time Adaptation (TTA) methods have emerged to adapt to target distributions during inference.<n>We propose Dual Uncertainty Optimization (DUO), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD.<n>In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues.
arXiv Detail & Related papers (2025-08-28T07:09:21Z) - Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z) - Why LLM Safety Guardrails Collapse After Fine-tuning: A Similarity Analysis Between Alignment and Fine-tuning Datasets [64.96967819446553]
This paper investigates the degradation of safety guardrails through the lens of representation similarity between upstream alignment datasets and downstream fine-tuning tasks.<n>High similarity between these datasets significantly weakens safety guardrails, making models more susceptible to jailbreaks.<n>Low similarity between these two types of datasets yields substantially more robust models and thus reduces harmfulness score by up to 10.33%.
arXiv Detail & Related papers (2025-06-05T17:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.