Related papers: Steering Language Models Before They Speak: Logit-Level Interventions

Steering Language Models Before They Speak: Logit-Level Interventions

URL: http://arxiv.org/abs/2601.10960v1
Date: Fri, 16 Jan 2026 03:00:33 GMT
Title: Steering Language Models Before They Speak: Logit-Level Interventions
Authors: Hyeseon An, Shinwoo Park, Hyundong Jin, Yo-Sub Han,
Abstract summary: We propose a training-free inference-time logit intervention for controllable generation.<n>Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains.
Score: 9.055997973281919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Steering LLMs is essential for specialized applications such as style-sensitive text rewriting, user-adaptive communication, and toxicity mitigation. Current steering methods, such as prompting-based and activation-based approaches, are widely used to guide model behavior. However, activation-based techniques require deep access to internal layers, while prompting-based steering often fails to provide consistent or fine-grained control. In order to address these limitations, we propose a training-free inference-time logit intervention for controllable generation. Our approach utilizes a statistical token score table derived from z-normalized log-odds of labeled corpora to shift the decoding distribution. Empirical evaluations across three diverse datasets focusing on writing complexity, formality, and toxicity demonstrate that our method effectively steers output characteristics, confirming its broad applicability and task-agnostic nature. Our results show that statistically grounded logit steering can achieve large, consistent, and multi-task control gains: up to +47%p accuracy and 50x f1 improvement.

Related papers

AMPS: Adaptive Modality Preference Steering via Functional Entropy [66.69992693275061]
We introduce an instance-aware diagnostic metric that quantifies each modality's information contribution and reveals sample-specific susceptibility to steering.<n> Experimental results show that our instance-aware steering outperforms conventional steering in modulating modality preference.
arXiv Detail & Related papers (2026-02-13T02:29:06Z)
Mechanistic Indicators of Steering Effectiveness in Large Language Models [3.635648354808971]
Activation-based steering enables Large Language Models to exhibit targeted behaviors by intervening on intermediate activations without retraining.<n>Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood.<n>We investigate whether the reliability of steering can be diagnosed using internal model signals.
arXiv Detail & Related papers (2026-02-02T06:56:22Z)
RISER: Orchestrating Latent Reasoning Skills for Adaptive Activation Steering [62.63376387138257]
We propose a plug-and-play intervention framework that adaptively steers large language models (LLMs) reasoning in activation space.<n>RISER constructs a library of reusable reasoning vectors and employs a lightweight Router to dynamically compose them for each input.<n>The Router is optimized via reinforcement learning under task-level rewards, activating latent cognitive primitives in an emergent and compositional manner.
arXiv Detail & Related papers (2026-01-14T08:04:33Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
In-Distribution Steering: Balancing Control and Coherence in Language Model Generation [0.0815557531820863]
We introduce In-Distribution Steering (IDS), a novel method that adapts steering strength based on the input data distribution in representation space.<n>IDS achieves strong accuracy on classification tasks while producing coherent text without collapse, making IDS particularly well suited for real-world applications.
arXiv Detail & Related papers (2025-10-15T08:31:37Z)
Attribution-Guided Decoding [24.52258081219335]
We introduce Attribution-Guided Decoding (AGD), an interpretability-based decoding strategy.<n>Instead of directly manipulating model activations, AGD considers a set of high-probability output token candidates.<n>We demonstrate AGD's efficacy across three challenging domains.
arXiv Detail & Related papers (2025-09-30T14:21:40Z)
RationAnomaly: Log Anomaly Detection with Rationality via Chain-of-Thought and Reinforcement Learning [27.235259453535537]
RationAnomaly is a novel framework that enhances log anomaly detection by synergizing Chain-of-Thought fine-tuning with reinforcement learning.<n>We have released the corresponding resources, including code and datasets.
arXiv Detail & Related papers (2025-09-18T07:35:58Z)
Steering When Necessary: Flexible Steering Large Language Models with Backtracking [16.23081952791394]
Large language models (LLMs) have achieved remarkable performance across many generation tasks.<n> Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage.<n>We propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention.
arXiv Detail & Related papers (2025-08-25T03:01:30Z)
The Synergy of LLMs & RL Unlocks Offline Learning of Generalizable Language-Conditioned Policies with Low-fidelity Data [50.544186914115045]
TEDUO is a novel training pipeline for offline language-conditioned policy learning in symbolic environments.<n>Our approach harnesses large language models (LLMs) in a dual capacity: first, as automatization tools augmenting offline datasets with richer annotations, and second, as generalizable instruction-following agents.
arXiv Detail & Related papers (2024-12-09T18:43:56Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Fine-Tuning Language Models Using Formal Methods Feedback [53.24085794087253]
We present a fully automated approach to fine-tune pre-trained language models for applications in autonomous systems. The method synthesizes automaton-based controllers from pre-trained models guided by natural language task descriptions. The results indicate an improvement in percentage of specifications satisfied by the controller from 60% to 90%.
arXiv Detail & Related papers (2023-10-27T16:24:24Z)
Cluster-level pseudo-labelling for source-free cross-domain facial expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER) Our method exploits self-supervised pretraining to learn good feature representations from the target data. We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.