Related papers: ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment

URL: http://arxiv.org/abs/2602.17560v2
Date: Mon, 23 Feb 2026 00:20:41 GMT
Title: ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment
Authors: Hongjue Zhao, Haosen Sun, Jiangtao Kong, Xiaochang Li, Qineng Wang, Liwei Jiang, Qi Zhu, Tarek Abdelzaher, Yejin Choi, Manling Li, Huajie Shao,
Abstract summary: We propose a unified ordinary differential equations (ODEs)-based theoretical framework for activation steering.<n>We introduce ODESteer, a kind of ODE-based steering guided by barrier functions.<n>Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements.
Score: 49.68063561145927
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Activation steering, or representation engineering, offers a lightweight approach to align large language models (LLMs) by manipulating their internal activations at inference time. However, current methods suffer from two key limitations: (i) the lack of a unified theoretical framework for guiding the design of steering directions, and (ii) an over-reliance on one-step steering that fail to capture complex patterns of activation distributions. In this work, we propose a unified ordinary differential equations (ODEs)-based theoretical framework for activation steering in LLM alignment. We show that conventional activation addition can be interpreted as a first-order approximation to the solution of an ODE. Based on this ODE perspective, identifying a steering direction becomes equivalent to designing a barrier function from control theory. Derived from this framework, we introduce ODESteer, a kind of ODE-based steering guided by barrier functions, which shows empirical advancement in LLM alignment. ODESteer identifies steering directions by defining the barrier function as the log-density ratio between positive and negative activations, and employs it to construct an ODE for multi-step and adaptive steering. Compared to state-of-the-art activation steering methods, ODESteer achieves consistent empirical improvements on diverse LLM alignment benchmarks, a notable $5.7\%$ improvement over TruthfulQA, $2.5\%$ over UltraFeedback, and $2.4\%$ over RealToxicityPrompts. Our work establishes a principled new view of activation steering in LLM alignment by unifying its theoretical foundations via ODEs, and validating it empirically through the proposed ODESteer method.

Related papers

Mechanistic Indicators of Steering Effectiveness in Large Language Models [3.635648354808971]
Activation-based steering enables Large Language Models to exhibit targeted behaviors by intervening on intermediate activations without retraining.<n>Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood.<n>We investigate whether the reliability of steering can be diagnosed using internal model signals.
arXiv Detail & Related papers (2026-02-02T06:56:22Z)
Efficient and accurate steering of Large Language Models through attention-guided feature learning [2.2940141855172036]
We introduce an attention-guided steering framework that overcomes three core challenges associated with steering.<n>Across a steering benchmark of 512 semantic concepts, our framework substantially improved steering over previous state-of-the-art.<n>Our framework opens further avenues for developing efficient, highly-scalable fine-tuning algorithms for industry-scale LLMs.
arXiv Detail & Related papers (2026-01-30T21:35:13Z)
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z)
Zero-Shot Instruction Following in RL via Structured LTL Representations [54.08661695738909]
Linear temporal logic (LTL) is a compelling framework for specifying complex, structured tasks for reinforcement learning (RL) agents.<n>Recent work has shown that interpreting instructions as finite automata, which can be seen as high-level programs monitoring task progress, enables learning a single generalist policy capable of executing arbitrary instructions at test time.<n>We propose a novel approach to learning a multi-task policy for following arbitrary instructions that addresses this shortcoming.
arXiv Detail & Related papers (2025-12-02T10:44:51Z)
BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles [0.0]
We propose a novel framework for multimodal reinforcement learning (RL) for autonomous lane-keeping (LK)<n>The proposed method lets the agent learn driving rules that are aware of their surroundings and easy to understand.<n>A hybrid reward function that includes semantic alignment, LK accuracy, obstacle avoidance, and speed regulation helps learning to be more efficient and generalizable.
arXiv Detail & Related papers (2025-10-25T17:27:08Z)
Activation Steering with a Feedback Controller [4.609594868699996]
Proportional-Integral-Derivative (PID) Steering is a principled framework that leverages the full PID controller for activation steering in large language models.<n>PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.
arXiv Detail & Related papers (2025-10-05T18:05:28Z)
EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering [55.56674028743782]
Large language model (LLM) steering has emerged as a promising paradigm for controlling model behavior at inference time.<n>We present EasySteer, a unified framework for high-performance, LLM steering built on vLLM.
arXiv Detail & Related papers (2025-09-29T17:59:07Z)
Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving [55.13109926181247]
We introduce ReflectDrive, a learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion.<n>Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient.<n>Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors.
arXiv Detail & Related papers (2025-09-24T13:35:15Z)
Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z)
EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning [145.32076310071434]
We propose EvolveNav, a novel embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning.<n>EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with curated formalized CoT labels to first activate the model's navigational reasoning capabilities, and simultaneously increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity.
arXiv Detail & Related papers (2025-06-02T11:28:32Z)
Effectively Steer LLM To Follow Preference via Building Confident Directions [39.40603123075168]
We propose a theoretical framework to understand and quantify the model steering methods.<n>Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations.<n>Our approach offers three key advantages over popular bidirectional model steering methods.
arXiv Detail & Related papers (2025-03-04T20:32:27Z)
LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs.<n>We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency.<n>Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z)
DECIDER: A Dual-System Rule-Controllable Decoding Framework for Language Generation [57.07295906718989]
Constrained decoding approaches aim to control the meaning or style of text generated by pre-trained large language (Ms also PLMs) for various tasks at inference time.<n>These methods often guide plausible continuations by greedily and explicitly selecting targets.<n>Inspired by cognitive dual-process theory, we propose a novel decoding framework DECIDER.
arXiv Detail & Related papers (2024-03-04T11:49:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.