Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection
- URL: http://arxiv.org/abs/2601.19375v1
- Date: Tue, 27 Jan 2026 08:56:25 GMT
- Title: Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection
- Authors: Quy-Anh Dang, Chris Ngo,
- Abstract summary: Large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors.<n>We propose Selective Steering, which addresses these limitations through two key innovations.<n> Experiments across nine models demonstrate that Selective Steering achieves 5.5x higher attack success rates than prior methods.
- Score: 1.7802147489386628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but existing methods suffer from critical limitations: activation addition requires careful coefficient tuning and is sensitive to layer-specific norm variations, while directional ablation provides only binary control. Recent work on Angular Steering introduces continuous control via rotation in a 2D subspace, but its practical implementation violates norm preservation, causing distribution shift and generation collapse, particularly in models below 7B parameters. We propose Selective Steering, which addresses these limitations through two key innovations: (1) a mathematically rigorous norm-preserving rotation formulation that maintains activation distribution integrity, and (2) discriminative layer selection that applies steering only where feature representations exhibit opposite-signed class alignment. Experiments across nine models demonstrate that Selective Steering achieves 5.5x higher attack success rates than prior methods while maintaining zero perplexity violations and approximately 100\% capability retention on standard benchmarks. Our approach provides a principled, efficient framework for controllable and stable LLM behavior modification. Code: https://github.com/knoveleng/steering
Related papers
- Primary-Fine Decoupling for Action Generation in Robotic Imitation [91.2899765310853]
Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning.<n>We propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations.<n>PF-DAG outperforms state-of-the-art baselines across 56 tasks from Adroit, DexArt, and MetaWorld benchmarks.
arXiv Detail & Related papers (2026-02-25T08:36:45Z) - Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions [37.08071497197165]
Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning.<n>We build on the principles of distributed alignment search to propose a new steering method: Concept DAS.<n>We show that Concept DAS does not always outperform preference-optimization methods but may benefit more from increased model scale.
arXiv Detail & Related papers (2026-02-05T02:51:00Z) - Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z) - Angular Steering: Behavior Control via Rotation in Activation Space [1.3400719989424488]
Angular Steering is a novel and flexible method for behavior modulation.<n>It operates by rotating activations within a fixed two-dimensional subspace.<n>It provides continuous, fine-grained control over behaviors such as refusal and compliance.
arXiv Detail & Related papers (2025-10-30T08:23:35Z) - PIXEL: Adaptive Steering Via Position-wise Injection with eXact Estimated Levels under Subspace Calibration [17.225716209866086]
We propose a position-wise activation steering framework for large language models (LLMs) on the web.<n>PIXEL learns a property-aligned subspace from dual views and selects intervention strength via a constrained geometric objective.<n>PIXEL consistently improves attribute alignment while preserving model general capabilities.
arXiv Detail & Related papers (2025-10-11T13:13:34Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - Normalized Attention Guidance: Universal Negative Guidance for Diffusion Models [57.20761595019967]
We present Normalized Attention Guidance (NAG), an efficient, training-free mechanism that applies extrapolation in attention space with L1-based normalization and refinement.<n>NAG restores effective negative guidance where CFG collapses while maintaining fidelity.<n>NAG generalizes across architectures (UNet, DiT), sampling regimes (few-step, multi-step), and modalities (image, video)
arXiv Detail & Related papers (2025-05-27T13:30:46Z) - Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs [8.085475675888045]
Activation steering provides an alternative for inference-time control.<n>We introduce a novel approach using a lightweight, trainable controller network integrated during inference.
arXiv Detail & Related papers (2025-05-22T01:48:38Z) - AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender [99.3105257001476]
We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics.<n>AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD)<n>Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
arXiv Detail & Related papers (2025-04-13T07:39:17Z) - Towards Continual Learning Desiderata via HSIC-Bottleneck
Orthogonalization and Equiangular Embedding [55.107555305760954]
We propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion.
Our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
arXiv Detail & Related papers (2024-01-17T09:01:29Z) - Let Offline RL Flow: Training Conservative Agents in the Latent Space of
Normalizing Flows [58.762959061522736]
offline reinforcement learning aims to train a policy on a pre-recorded and fixed dataset without any additional environment interactions.
We build upon recent works on learning policies in latent action spaces and use a special form of Normalizing Flows for constructing a generative model.
We evaluate our method on various locomotion and navigation tasks, demonstrating that our approach outperforms recently proposed algorithms.
arXiv Detail & Related papers (2022-11-20T21:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.