Learning Distribution-Wise Control in Representation Space for Language Models
- URL: http://arxiv.org/abs/2506.06686v1
- Date: Sat, 07 Jun 2025 06:52:58 GMT
- Title: Learning Distribution-Wise Control in Representation Space for Language Models
- Authors: Chunyuan Deng, Ruidi Chang, Hanjie Chen,
- Abstract summary: Learnable interventions aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors.<n>We extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace.
- Score: 7.756342860929851
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: \href{https://github.com/chili-lab/D-Intervention}{https://github.com/chili-lab/D-Intervention}.
Related papers
- GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z) - HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z) - HSI: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models [2.6703221234079946]
We show that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination for Llama 2.<n>Our method applies fine-grained interventions at specific model sub-components, particularly attention heads, using a simple binary choice probing strategy.<n>We show that probing single attention heads is more effective than intervening on full layers and intervening on only four attention heads is comparable to supervised fine-tuning.
arXiv Detail & Related papers (2025-02-09T16:11:57Z) - Diffusion Predictive Control with Constraints [51.91057765703533]
Diffusion predictive control with constraints (DPCC) is an algorithm for diffusion-based control with explicit state and action constraints.<n>We show through simulations of a robot manipulator that DPCC outperforms existing methods in satisfying novel test-time constraints.
arXiv Detail & Related papers (2024-12-12T15:10:22Z) - Refusal in LLMs is an Affine Function [1.722461331472526]
We propose affine concept editing (ACE) as an approach for steering language models' behavior.<n>ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses.<n>Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods.
arXiv Detail & Related papers (2024-11-13T20:12:55Z) - Towards Unifying Interpretability and Control: Evaluation via Intervention [25.4582941170387]
We argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions.<n>We extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens, and probing-into an abstract encoder-decoder framework.<n>We introduce two new evaluation metrics: intervention success rate and coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior.
arXiv Detail & Related papers (2024-11-07T04:52:18Z) - Generalize or Detect? Towards Robust Semantic Segmentation Under Multiple Distribution Shifts [56.57141696245328]
In open-world scenarios, where both novel classes and domains may exist, an ideal segmentation model should detect anomaly classes for safety.
Existing methods often struggle to distinguish between domain-level and semantic-level distribution shifts.
arXiv Detail & Related papers (2024-11-06T11:03:02Z) - Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text.<n>One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations.<n>This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z) - Learning a Diffusion Model Policy from Rewards via Q-Score Matching [93.0191910132874]
We present a theoretical framework linking the structure of diffusion model policies to a learned Q-function.<n>We propose a new policy update method from this theory, which we denote Q-score matching.
arXiv Detail & Related papers (2023-12-18T23:31:01Z) - Variance-Preserving-Based Interpolation Diffusion Models for Speech
Enhancement [53.2171981279647]
We present a framework that encapsulates both the VP- and variance-exploding (VE)-based diffusion methods.
To improve performance and ease model training, we analyze the common difficulties encountered in diffusion models.
We evaluate our model against several methods using a public benchmark to showcase the effectiveness of our approach.
arXiv Detail & Related papers (2023-06-14T14:22:22Z) - Manifold-Aware Self-Training for Unsupervised Domain Adaptation on
Regressing 6D Object Pose [69.14556386954325]
Domain gap between synthetic and real data in visual regression is bridged in this paper via global feature alignment and local refinement.
Our method incorporates an explicit self-supervised manifold regularization, revealing consistent cumulative target dependency across domains.
Learning unified implicit neural functions to estimate relative direction and distance of targets to their nearest class bins aims to refine target classification predictions.
arXiv Detail & Related papers (2023-05-18T08:42:41Z) - On Reinforcement Learning and Distribution Matching for Fine-Tuning
Language Models with no Catastrophic Forgetting [5.5302127686575435]
Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM)
We show that methods such as KL-control developed for RM can also be construed as belonging to DM.
We leverage connections between the two paradigms to import the concept of baseline into DM methods.
arXiv Detail & Related papers (2022-06-01T20:54:41Z) - Modeling Human Driver Interactions Using an Infinite Policy Space
Through Gaussian Processes [0.0]
This paper proposes a method for modeling human driver interactions that relies on multi-output gaussian processes.
The proposed method is validated on a real traffic dataset to demonstrate its contributions and implications.
arXiv Detail & Related papers (2022-01-03T17:45:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.