Related papers: ROAST: Rollout-based On-distribution Activation Steering Technique

ROAST: Rollout-based On-distribution Activation Steering Technique

URL: http://arxiv.org/abs/2602.14143v1
Date: Sun, 15 Feb 2026 13:30:26 GMT
Title: ROAST: Rollout-based On-distribution Activation Steering Technique
Authors: Xuanbo Su, Hao Luo, Yingfang Zhang, Lijun Zhang,
Abstract summary: Activation steering provides parameter-efficient control over large language models at inference time.<n>We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC.<n>Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality.
Score: 16.632201561391366
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Activation steering provides parameter-efficient control over large language models (LLMs) at inference time, but many methods rely on off-distribution supervision and discrete masking, leading to brittle interventions. We propose ROAST (Rollout-based On-distribution Activation Steering Technique), which estimates steering directions from the model's own on-distribution rollouts via ROC and avoids hard sparsification via Continuous Soft Scaling (CSS) and Grouped Mean Normalization. Our empirical analysis reveals that while activation magnitude correlates moderately with directional consistency, the variance in magnitude is significant and often disproportionate to semantic quality. This suggests that high-magnitude activations risk dominating the global steering direction if not properly normalized. To address this, ROAST employs grouped normalization to balance contributions across samples, ensuring a more robust estimation of the consensus steering direction. Across models (0.6B to 32B), ROAST consistently improves performance on diverse tasks (e.g., +9.7% on GSM8K for Qwen3-0.6B and +12.1% on TruthfulQA for GLM4-32B), and analyses show that CSS better preserves activation energy.

Related papers

Weight Updates as Activation Shifts: A Principled Framework for Steering [54.70188910511715]
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices.<n>We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior.<n>This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site.
arXiv Detail & Related papers (2026-02-28T02:50:04Z)
ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z)
Information Fidelity in Tool-Using LLM Agents: A Martingale Analysis of the Model Context Protocol [69.11739400975445]
We introduce the first theoretical framework for analyzing error accumulation in Model Context Protocol (MCP) agents.<n>We show that cumulative distortion exhibits linear growth and high-probability deviations bounded by $O(sqrtT)$.<n>Key findings include: semantic weighting reduces distortion by 80%, and periodic re-grounding approximately every 9 steps suffices for error control.
arXiv Detail & Related papers (2026-02-10T21:08:53Z)
Dynamics-Aligned Shared Hypernetworks for Zero-Shot Actuator Inversion [3.335249027791264]
We propose DMA*-SH, a framework where a single hypernetwork, trained solely via dynamics prediction, generates a small set of adapter weights.<n>This shared modulation imparts an inductive bias matched to actuator inversion, while input/output normalization and random input masking stabilize context inference.<n>For evaluation, we introduce the Actuator Inversion Benchmark (AIB), a suite of environments designed to isolate discontinuous context-to-dynamics interactions.
arXiv Detail & Related papers (2026-02-06T09:55:05Z)
Group Distributionally Robust Optimization-Driven Reinforcement Learning for LLM Reasoning [45.86058898829962]
Multi-Ad Distributionally Robust Optimization (GDRO) is an optimization-first framework that moves beyond uniform reasoning.<n>We propose two independent GDRO games for post-training: Prompt-GDRO, which employs an EMA-debiased multiplicative-weight bandit sampler to target the intensive difficulty margin and upweight persistently hard groups without frequency bias; and Rollout-GDRO, which uses a shadow-price controller to reallocate rollouts across groups, maximizing gradient variance reduction on hard tasks under a fixed mean budget (compute-neutral)<n>We validate our framework on the DAPO 14.1k dataset using Q
arXiv Detail & Related papers (2026-01-27T07:10:41Z)
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Adaptive Sample-Level Framework Motivated by Distributionally Robust Optimization with Variance-Based Radius Assignment for Enhanced Neural Network Generalization Under Distribution Shift [0.8101875496469488]
Distribution shifts and minority subpopulations frequently undermine the reliability of deep neural networks trained using Empirical Risk Minimization (ERM)<n>We propose a variance-driven, sample-level DRO framework that automatically identifies high-risk training samples and assigns a personalized robustness budget to each based on its online loss variance.
arXiv Detail & Related papers (2025-11-04T10:20:21Z)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
Improving LLM Reasoning through Interpretable Role-Playing Steering [33.25597755294326]
Role-playing has emerged as an effective technique for enhancing the reasoning capabilities of large language models (LLMs)<n>We introduce Sparse Autoencoder Role-Playing Steering (SRPS), a novel framework that identifies and manipulates internal model features associated with role-playing behavior.<n>Our approach extracts latent representations from role-play prompts, selects the most relevant features based on activation patterns, and constructs a steering vector that can be injected into the model's residual stream with controllable intensity.
arXiv Detail & Related papers (2025-06-09T00:31:17Z)
Fusion Steering: Prompt-Specific Activation Control [0.0]
Fusion Steering improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks.<n>This approach introduces flexible steering configurations, including full-layer steering and segmented steering.<n>Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%.
arXiv Detail & Related papers (2025-05-28T16:46:55Z)
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs [8.91107152198979]
We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes.<n>We compute 8 steering vectors, each corresponding to a different social bias axis, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets.<n>When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet.
arXiv Detail & Related papers (2025-03-07T12:25:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.