In-Distribution Steering: Balancing Control and Coherence in Language Model Generation
- URL: http://arxiv.org/abs/2510.13285v1
- Date: Wed, 15 Oct 2025 08:31:37 GMT
- Title: In-Distribution Steering: Balancing Control and Coherence in Language Model Generation
- Authors: Arthur Vogels, Benjamin Wong, Yann Choho, Annabelle Blangero, Milan Bhan,
- Abstract summary: We introduce In-Distribution Steering (IDS), a novel method that adapts steering strength based on the input data distribution in representation space.<n>IDS achieves strong accuracy on classification tasks while producing coherent text without collapse, making IDS particularly well suited for real-world applications.
- Score: 0.0815557531820863
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Activation steering methods control large language model (LLM) behavior by modifying internal activations at inference time. However, most existing activation steering methods rely on a fixed steering strength, leading to either insufficient control or unadapted intervention that degrades text plausibility and coherence. We introduce In-Distribution Steering (IDS), a novel method that adapts steering strength based on the input data distribution in representation space. IDS dynamically adjusts interventions according to how far a given input lies within the distribution, enabling adaptive intervention and generation stability during text generation. Experiments demonstrate that IDS achieves strong accuracy on classification tasks while producing coherent text without collapse, making IDS particularly well suited for real-world applications.
Related papers
- AMPS: Adaptive Modality Preference Steering via Functional Entropy [66.69992693275061]
We introduce an instance-aware diagnostic metric that quantifies each modality's information contribution and reveals sample-specific susceptibility to steering.<n> Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference.
arXiv Detail & Related papers (2026-02-13T02:29:06Z) - Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z) - ILRR: Inference-Time Steering Method for Masked Diffusion Language Models [8.458339111154585]
We introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence.<n>ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process.<n>We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence.
arXiv Detail & Related papers (2026-01-29T12:48:59Z) - STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model [71.35577462669856]
We propose a robust, provably secure linguistic steganography with diffusion language models (DLMs)<n>We introduce error correction strategies, including pseudo-random error correction and neighborhood search correction, during steganographic extraction.
arXiv Detail & Related papers (2026-01-21T08:58:12Z) - Steering Language Models Before They Speak: Logit-Level Interventions [9.055997973281919]
We propose a training-free inference-time logit intervention for controllable generation.<n>Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains.
arXiv Detail & Related papers (2026-01-16T03:00:33Z) - Prototype-Based Dynamic Steering for Large Language Models [3.90727941420584]
Prototype-Based Dynamic Steering (PDS) is a test-time method that amplifies large language model (LLM) reasoning without adding or altering instructions.<n>We introduce "reasoning prototypes" by clustering activation differences between Chain-of-Thought (CoT) and neutral prompts.<n>PDS consistently improves accuracy without fine-tuning or prompt engineering.
arXiv Detail & Related papers (2025-10-07T01:34:28Z) - Activation Steering with a Feedback Controller [4.609594868699996]
Proportional-Integral-Derivative (PID) Steering is a principled framework that leverages the full PID controller for activation steering in large language models.<n>PID Steering consistently outperforms existing approaches, achieving more robust and reliable behavioral control.
arXiv Detail & Related papers (2025-10-05T18:05:28Z) - Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance [63.33213516925946]
We introduce textbfAlign-Then-stEer (textttATE), a novel, data-efficient, and plug-and-play adaptation framework.<n>Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
arXiv Detail & Related papers (2025-09-02T07:51:59Z) - GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation [94.84458417662404]
LangTraj is a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios.<n>By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors.<n>LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation.
arXiv Detail & Related papers (2025-04-15T17:14:06Z) - LF-Steering: Latent Feature Activation Steering for Enhancing Semantic Consistency in Large Language Models [16.37602070339033]
Large Language Models (LLMs) often generate inconsistent responses when prompted with semantically equivalent paraphrased inputs.<n>We propose LF-Steering, a novel activation steering approach to precisely identify latent feature representations responsible for semantic inconsistency.<n>Our method maps the hidden states of the relevant transformer layer into a sparsely activated, high-dimensional feature space based on a sparse autoencoder.
arXiv Detail & Related papers (2025-01-19T13:06:51Z) - Diffusion Predictive Control with Constraints [51.91057765703533]
Diffusion predictive control with constraints (DPCC) is an algorithm for diffusion-based control with explicit state and action constraints.<n>We show through simulations of a robot manipulator that DPCC outperforms existing methods in satisfying novel test-time constraints.
arXiv Detail & Related papers (2024-12-12T15:10:22Z) - Linearly Controlled Language Generation with Performative Guarantees [4.447467536572626]
We use a common model of concept semantics as linearly represented in an LM's latent space.<n>We take the view that natural language generation traces a trajectory in this continuous semantic space.<n>We propose a lightweight, gradient-free intervention that dynamically steers trajectories away from regions corresponding to undesired meanings.
arXiv Detail & Related papers (2024-05-24T11:30:44Z) - InferAligner: Inference-Time Alignment for Harmlessness through
Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment.
Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics.
It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.