Related papers: Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

Safety Subspaces are Not Distinct: A Fine-Tuning Case Study

URL: http://arxiv.org/abs/2505.14185v1
Date: Tue, 20 May 2025 10:41:49 GMT
Title: Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Authors: Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma,
Abstract summary: We study whether safety-relevant behavior is concentrated in specific subspaces.<n>We find no evidence of a subspace that selectively governs safety.<n>This suggests that subspace-based defenses may face fundamental limitations.
Score: 4.724646466332421
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. This is typically achieved through instruction tuning and reinforcement learning from human feedback. However, this alignment is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable geometric directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this geometric perspective. We examine whether safety-relevant behavior is concentrated in specific subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in internal representations. Across both parameter and activation space, our findings are consistent: subspaces that amplify safe behaviors also amplify unsafe ones, and prompts with different safety implications activate overlapping representations. We find no evidence of a subspace that selectively governs safety. These results challenge the assumption that alignment is geometrically localized. Rather than residing in distinct directions, safety appears to emerge from entangled, high-impact components of the model's broader learning dynamics. This suggests that subspace-based defenses may face fundamental limitations and underscores the need for alternative strategies to preserve alignment under continued training. We corroborate these findings through multiple experiments on five open-source LLMs. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.

Related papers

The Geometry of Harmfulness in LLMs through Subconcept Probing [3.6335172274433414]
We introduce a multidimensional framework for probing and steering harmful content in language models.<n>For each of 55 distinct harmfulness subconcepts, we learn a linear probe, yielding 55 interpretable directions in activation space.<n>We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction.
arXiv Detail & Related papers (2025-07-23T07:56:05Z)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z)
Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z)
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? [73.80382983108997]
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models.<n>If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution jailbreaks.<n>We propose Concept Concentration (COCA), which simplifies the decision boundary between harmful and benign representations.
arXiv Detail & Related papers (2025-05-24T12:23:52Z)
Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis [20.522881564776434]
We find that safety-aligned behavior is jointly controlled by multi-dimensional directions.<n>By studying directions in the space, we first find that a dominant direction governs the model's refusal behavior.<n>We then measure how different directions promote or suppress the dominant direction.
arXiv Detail & Related papers (2025-02-13T06:39:22Z)
Superficial Safety Alignment Hypothesis [8.297367440457508]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction. We identify four types of attribute-critical components in safety-aligned large language models (LLMs) Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z)
Evaluating Defences against Unsafe Feedback in RLHF [26.872318173182414]
This paper looks at learning from unsafe feedback with reinforcement learning.<n>We find that safety-aligned LLMs easily explore unsafe action spaces via generating harmful text.<n>In order to protect against this vulnerability, we adapt a number of both "implict" and "explicit" harmful fine-tuning defences.
arXiv Detail & Related papers (2024-09-19T17:10:34Z)
What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment. We design a synthetic data generation framework that captures salient aspects of an unsafe input. Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z)
Safety Alignment Should Be Made More Than Just a Few Tokens Deep [48.823599143711235]
The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits.
arXiv Detail & Related papers (2024-06-10T00:35:23Z)
Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals [52.123343364599094]
adversarial attacks place carefully crafted perturbations on normal examples to fool deep neural networks (DNNs) We first empirically show that the features of either clean signals or adversarial perturbations are redundant and span in low-dimensional linear subspaces respectively with minimal overlap. This makes it possible for DNNs to learn a subspace where only features of clean signals exist while those of perturbations are discarded.
arXiv Detail & Related papers (2024-03-24T14:35:44Z)
Provable Safe Reinforcement Learning with Binary Feedback [62.257383728544006]
We consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting.
arXiv Detail & Related papers (2022-10-26T05:37:51Z)
Fail-Safe Adversarial Generative Imitation Learning [9.594432031144716]
We propose a safety layer that enables a closed-form probability density/gradient of the safe generative continuous policy, end-to-end generative adversarial training, and worst-case safety guarantees. The safety layer maps all actions into a set of safe actions, and uses the change-of-variables formula plus additivity of measures for the density. In an experiment on real-world driver interaction data, we empirically demonstrate tractability, safety and imitation performance of our approach.
arXiv Detail & Related papers (2022-03-03T13:03:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.