Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
- URL: http://arxiv.org/abs/2505.20309v1
- Date: Thu, 22 May 2025 01:48:38 GMT
- Title: Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
- Authors: Amr Hegazy, Mostafa Elhoushi, Amr Alanwar,
- Abstract summary: Activation steering provides an alternative for inference-time control.<n>We introduce a novel approach using a lightweight, trainable controller network integrated during inference.
- Score: 3.2361985831403404
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed "refusal direction" vector, applied across the LLM's layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.
Related papers
- LLMs-guided adaptive compensator: Bringing Adaptivity to Automatic Control Systems with Large Language Models [22.989496527440636]
Large Language Models (LLMs) are increasingly applied in robotics.<n>We propose an LLM-guided adaptive compensator framework that avoids designing controllers from scratch.<n>This study opens a new direction for applying LLMs in the field of automatic control.
arXiv Detail & Related papers (2025-07-28T04:12:43Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint [49.641959856967276]
We present a theoretically grounded and empirically effective activation steering method called AlphaSteer.<n>For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints.<n>Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer.
arXiv Detail & Related papers (2025-06-08T07:03:28Z) - Patterns and Mechanisms of Contrastive Activation Engineering [0.374490703387131]
CAE has the potential to introduce a new paradigm of flexible, task-specific behavior tuning.<n>We analyze the performance of CAE in in in-distribution, out-of-distribution settings, evaluate drawbacks, and begin to develop comprehensive guidelines for its effective deployment.
arXiv Detail & Related papers (2025-05-06T05:15:12Z) - AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender [73.09848497762667]
We propose AdaSteer, an adaptive activation steering method that adjusts model behavior based on input characteristics.<n>AdaSteer steers input representations along both the Rejection Direction (RD) and Harmfulness Direction (HD)<n>Our results highlight the potential of interpretable model internals for real-time, flexible safety enforcement in LLMs.
arXiv Detail & Related papers (2025-04-13T07:39:17Z) - Investigating Generalization of One-shot LLM Steering Vectors [21.2431937128876]
We propose optimizing steering vectors through gradient descent on a single training example.<n>We find that the resulting vectors effectively mediate safety-relevant behaviors in multiple models.
arXiv Detail & Related papers (2025-02-26T06:13:01Z) - Controlling Large Language Models Through Concept Activation Vectors [30.348768212571255]
We propose Generation with Concept Activation Vector (GCAV), a lightweight model control framework.<n>GCAV ensures accurate control without requiring resource-extensive fine-tuning.<n>Our framework achieves state-of-the-art performance with granular control, allowing for fine-grained adjustments of both the steering layers and the steering magnitudes for individual samples.
arXiv Detail & Related papers (2025-01-10T07:41:48Z) - Zero-Shot Strategies for Length-Controllable Summarization [56.15356055672189]
Large language models (LLMs) struggle with precise length control, particularly in zero-shot settings.<n>We conduct a comprehensive study evaluating LLMs' length control capabilities across multiple measures and propose practical methods to improve controllability.<n>Our experiments with LLaMA 3 reveal stark differences in length adherence across measures and highlight inherent biases of the model.
arXiv Detail & Related papers (2024-12-31T02:53:27Z) - Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z) - Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data.
This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization.
Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z) - Responsive Safety in Reinforcement Learning by PID Lagrangian Methods [74.49173841304474]
Lagrangian methods exhibit oscillations and overshoot which, when applied to safe reinforcement learning, leads to constraint-violating behavior.
We propose a novel Lagrange multiplier update method that utilizes derivatives of the constraint function.
We apply our PID Lagrangian methods in deep RL, setting a new state of the art in Safety Gym, a safe RL benchmark.
arXiv Detail & Related papers (2020-07-08T08:43:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.