Activation Steering with a Feedback Controller
- URL: http://arxiv.org/abs/2510.04309v1
- Date: Sun, 05 Oct 2025 18:05:28 GMT
- Title: Activation Steering with a Feedback Controller
- Authors: Dung V. Nguyen, Hieu M. Vu, Nhi Y. Pham, Lei Zhang, Tan M. Nguyen,
- Abstract summary: Proportional-Integral-Derivative (PID) Steering is a principled framework that leverages the full PID controller for activation steering in large language models.<n>PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.
- Score: 4.609594868699996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Controlling the behaviors of large language models (LLM) is fundamental to their safety alignment and reliable deployment. However, existing steering methods are primarily driven by empirical insights and lack theoretical performance guarantees. In this work, we develop a control-theoretic foundation for activation steering by showing that popular steering methods correspond to the proportional (P) controllers, with the steering vector serving as the feedback signal. Building on this finding, we propose Proportional-Integral-Derivative (PID) Steering, a principled framework that leverages the full PID controller for activation steering in LLMs. The proportional (P) term aligns activations with target semantic directions, the integral (I) term accumulates errors to enforce persistent corrections across layers, and the derivative (D) term mitigates overshoot by counteracting rapid activation changes. This closed-loop design yields interpretable error dynamics and connects activation steering to classical stability guarantees in control theory. Moreover, PID Steering is lightweight, modular, and readily integrates with state-of-the-art steering methods. Extensive experiments across multiple LLM families and benchmarks demonstrate that PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.
Related papers
- ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment [49.68063561145927]
We propose a unified ordinary differential equations (ODEs)-based theoretical framework for activation steering.<n>We introduce ODESteer, a kind of ODE-based steering guided by barrier functions.<n>Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements.
arXiv Detail & Related papers (2026-02-19T17:13:44Z) - AMPS: Adaptive Modality Preference Steering via Functional Entropy [66.69992693275061]
We introduce an instance-aware diagnostic metric that quantifies each modality's information contribution and reveals sample-specific susceptibility to steering.<n> Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference.
arXiv Detail & Related papers (2026-02-13T02:29:06Z) - Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z) - Mechanistic Indicators of Steering Effectiveness in Large Language Models [3.635648354808971]
Activation-based steering enables Large Language Models to exhibit targeted behaviors by intervening on intermediate activations without retraining.<n>Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood.<n>We investigate whether the reliability of steering can be diagnosed using internal model signals.
arXiv Detail & Related papers (2026-02-02T06:56:22Z) - RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z) - ATLAS: Adaptive Test-Time Latent Steering with External Verifiers for Enhancing LLMs Reasoning [13.073472989807675]
We propose Adaptive Test-time Latent Steering, called (ATLAS)<n>ATLAS dynamically controls steering decisions at inference time using an external, lightweight latent verifier.<n> Experiments on multiple mathematical reasoning benchmarks show that ATLAS consistently outperforms both vanilla decoding and fixed steering baselines.
arXiv Detail & Related papers (2026-01-06T15:27:24Z) - EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering [55.56674028743782]
Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time.<n>We present EasySteer, a unified framework for high-performance, LLM steering built on vLLM.
arXiv Detail & Related papers (2025-09-29T17:59:07Z) - KV Cache Steering for Controlling Frozen LLMs [80.50365534625438]
cache steering is a lightweight method for implicit steering of language models.<n>We apply cache steering to induce chain-of-thought reasoning in small language models.
arXiv Detail & Related papers (2025-07-11T17:59:36Z) - STU-PID: Steering Token Usage via PID Controller for Efficient Large Language Model Reasoning [0.0]
Large Language Models employing extended chain-of-thought (CoT) reasoning often suffer from the overthinking phenomenon.<n>We propose STUPID, a novel training-free method that employs a PID controller to dynamically activation modulate steering strength during inference.<n>Our approach combines a chunk-level classifier for detecting redundant reasoning patterns with a PID control mechanism that adaptively adjusts steering intensity based on the predicted redundancy probability.
arXiv Detail & Related papers (2025-06-23T16:47:19Z) - Instruction Following by Boosting Attention of Large Language Models [11.739148611340964]
latent steering is a lightweight technique that alters internal activations to guide generation.<n>InstABoost boosts the strength of instruction prompting by altering the model's attention during generation.<n>InstABoost demonstrates superior control success compared to both traditional prompting and latent steering.
arXiv Detail & Related papers (2025-06-16T17:42:35Z) - AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint [49.641959856967276]
We present a theoretically grounded and empirically effective activation steering method called AlphaSteer.<n>For utility preservation, it learns to construct a nearly zero vector for steering benign data, with the null-space constraints.<n>Experiments across multiple jailbreak attacks and utility benchmarks demonstrate the effectiveness of AlphaSteer.
arXiv Detail & Related papers (2025-06-08T07:03:28Z) - Autonomous Vehicle Lateral Control Using Deep Reinforcement Learning with MPC-PID Demonstration [23.245716549852332]
The controller is one of the most important modules in the autonomous driving pipeline.<n>In this work, despite the imperfections in the vehicle models due to measurement errors and simplifications, a reinforcement learning based lateral control approach is presented.
arXiv Detail & Related papers (2025-06-04T15:05:06Z) - Beyond Prompt Engineering: Robust Behavior Control in LLMs via Steering Target Atoms [71.85633762642125]
The vast number of parameters in models often results in highly intertwined internal representations.<n>Recent research has explored the use of sparse autoencoders (SAE) to disentangle knowledge in high-dimensional spaces for steering.<n>We propose Steering Target Atoms (STA), a novel method that isolates and manipulates disentangled knowledge components to enhance safety.
arXiv Detail & Related papers (2025-05-23T17:59:18Z) - Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs [3.2361985831403404]
Activation steering provides an alternative for inference-time control.<n>We introduce a novel approach using a lightweight, trainable controller network integrated during inference.
arXiv Detail & Related papers (2025-05-22T01:48:38Z) - Responsive Safety in Reinforcement Learning by PID Lagrangian Methods [74.49173841304474]
Lagrangian methods exhibit oscillations and overshoot which, when applied to safe reinforcement learning, leads to constraint-violating behavior.
We propose a novel Lagrange multiplier update method that utilizes derivatives of the constraint function.
We apply our PID Lagrangian methods in deep RL, setting a new state of the art in Safety Gym, a safe RL benchmark.
arXiv Detail & Related papers (2020-07-08T08:43:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.