Related papers: The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Safety Analysis

Related papers

Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z)
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment [7.145846466297704]
Safety alignment instills in Large Language Models a capacity to refuse malicious requests.<n>Prior works have modeled this refusal mechanism as a single linear direction in the activation space.<n>We introduce Differentiated Bi-Directional Intervention (DBDI), a new white-box framework that precisely neutralizes the safety alignment at critical layer.
arXiv Detail & Related papers (2025-11-10T08:52:34Z)
Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection [47.347413305965006]
Safety alignment in Large Language Models (LLMs) often involves mediating internal representations to refuse harmful requests.<n>Recent research has demonstrated that these safety mechanisms can be bypassed by ablating or removing specific representational directions.<n>We propose Rank-One Safety Injection (ROSI), a white-box method that amplifies a model's safety alignment by permanently steering its activations toward the refusal-mediating subspace.
arXiv Detail & Related papers (2025-08-28T13:22:33Z)
The Geometry of Harmfulness in LLMs through Subconcept Probing [3.6335172274433414]
We introduce a multidimensional framework for probing and steering harmful content in language models.<n>For each of 55 distinct harmfulness subconcepts, we learn a linear probe, yielding 55 interpretable directions in activation space.<n>We then test ablation of the entire subspace from model internals, as well as steering and ablation in the subspace's dominant direction.
arXiv Detail & Related papers (2025-07-23T07:56:05Z)
LLMs Encode Harmfulness and Refusal Separately [33.3511110052005]
LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing?<n>We identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal.<n>We find that certain jailbreak methods work by reducing the refusal signals without reversing the model's internal belief of harmfulness.
arXiv Detail & Related papers (2025-07-16T03:48:03Z)
Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs [0.0]
We show that fine tuning on insecure code induces internal changes that oppose alignment.<n>We identify a shared latent dimension in the model's activation space that governs alignment behavior.
arXiv Detail & Related papers (2025-07-04T15:36:58Z)
Probing the Robustness of Large Language Models Safety to Latent Perturbations [30.16804362984161]
Safety alignment is a key requirement for building reliable Artificial General Intelligence.<n>We observe that minor latent shifts can still trigger unsafe responses in aligned models.<n>We introduce Layer-wise Adversarial Patch Training(LAPT), a fine-tuning strategy that injects controlled perturbations into hidden representations during training.
arXiv Detail & Related papers (2025-06-19T07:03:05Z)
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment? [73.80382983108997]
Representation intervention aims to locate and modify the representations that encode the underlying concepts in Large Language Models.<n>If the interventions are faithful, the intervened LLMs should erase the harmful concepts and be robust to both in-distribution adversarial prompts and the out-of-distribution jailbreaks.<n>We propose Concept Concentration (COCA), which simplifies the decision boundary between harmful and benign representations.
arXiv Detail & Related papers (2025-05-24T12:23:52Z)
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study [4.724646466332421]
We study whether safety-relevant behavior is concentrated in specific subspaces.<n>We find no evidence of a subspace that selectively governs safety.<n>This suggests that subspace-based defenses may face fundamental limitations.
arXiv Detail & Related papers (2025-05-20T10:41:49Z)
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender [73.09848497762667]
We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics. AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD) Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
arXiv Detail & Related papers (2025-04-13T07:39:17Z)
Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z)
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence [57.57786477441956]
Prior work suggests that a single refusal direction in the model's activation space determines whether an LLM refuses a request. We propose a novel gradient-based approach to representation engineering and use it to identify refusal directions. We show that refusal mechanisms in LLMs are governed by complex spatial structures and identify functionally independent directions.
arXiv Detail & Related papers (2025-02-24T18:52:59Z)
Superficial Safety Alignment Hypothesis [8.297367440457508]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction. We identify four types of attribute-critical components in safety-aligned large language models (LLMs) Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons [57.07507194465299]
Large language models (LLMs) excel in various capabilities but pose safety risks such as generating harmful content and misinformation, even after safety alignment.<n>We focus on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors.<n>We propose inference-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects on model safety.
arXiv Detail & Related papers (2024-06-20T09:35:22Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Trojan Activation Attack: Red-Teaming Large Language Models using Activation Steering for Safety-Alignment [31.24530091590395]
We study an attack scenario called Trojan Activation Attack (TA2), which injects trojan steering vectors into the activation layers of Large Language Models. Our experiment results show that TA2 is highly effective and adds little or no overhead to attack efficiency.
arXiv Detail & Related papers (2023-11-15T23:07:40Z)
Where and What? Examining Interpretable Disentangled Representations [96.32813624341833]
Capturing interpretable variations has long been one of the goals in disentanglement learning. Unlike the independence assumption, interpretability has rarely been exploited to encourage disentanglement in the unsupervised setting. In this paper, we examine the interpretability of disentangled representations by investigating two questions: where to be interpreted and what to be interpreted.
arXiv Detail & Related papers (2021-04-07T11:22:02Z)
LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions [0.02294014185517203]
We propose a contrastive-learning-based approach for discovering semantic directions in the latent space of pretrained GANs. Our approach finds semantically meaningful dimensions compatible with state-of-the-art methods.
arXiv Detail & Related papers (2021-04-02T00:11:22Z)
Unsupervised Discovery of Interpretable Directions in the GAN Latent Space [39.54530450932134]
latent spaces of GAN models often have semantically meaningful directions. We introduce an unsupervised method to identify interpretable directions in the latent space of a pretrained GAN model. We show how to exploit this finding to achieve competitive performance for weakly-supervised saliency detection.
arXiv Detail & Related papers (2020-02-10T13:57:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.