Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions
- URL: http://arxiv.org/abs/2602.05234v1
- Date: Thu, 05 Feb 2026 02:51:00 GMT
- Title: Faithful Bi-Directional Model Steering via Distribution Matching and Distributed Interchange Interventions
- Authors: Yuntai Bao, Xuhong Zhang, Jintao Chen, Ge Su, Yuxiang Cai, Hao Peng, Bing Sun, Haiqin Weng, Liu Yan, Jianwei Yin,
- Abstract summary: Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning.<n>We build on the principles of distributed alignment search to propose a new steering method: Concept DAS.<n>We show that Concept DAS does not always outperform preference-optimization methods but may benefit more from increased model scale.
- Score: 37.08071497197165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering. Our code is available at https://github.com/colored-dye/concept_das.
Related papers
- Weight Updates as Activation Shifts: A Principled Framework for Steering [54.70188910511715]
Activation steering promises to be an extremely parameter-efficient form of adaptation, but its effectiveness depends on critical design choices.<n>We establish a first-order equivalence between activation-space interventions and weight-space updates, deriving the conditions under which activation steering can replicate fine-tuning behavior.<n>This equivalence yields a principled framework for steering design and identifies the post-block output as a theoretically-backed and highly expressive intervention site.
arXiv Detail & Related papers (2026-02-28T02:50:04Z) - Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z) - Inference-time Alignment via Sparse Junction Steering [25.464612964225484]
Token-level steering has emerged as a pivotal approach for inference-time alignment.<n>Existing methods rely on dense intervention at every decoding step.<n>We show that dense intervention is unnecessary and propose sparse junction steering.
arXiv Detail & Related papers (2026-01-30T08:40:47Z) - Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection [1.7802147489386628]
Large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors.<n>We propose Selective Steering, which addresses these limitations through two key innovations.<n> Experiments across nine models demonstrate that Selective Steering achieves 5.5x higher attack success rates than prior methods.
arXiv Detail & Related papers (2026-01-27T08:56:25Z) - Model-Based Diffusion Sampling for Predictive Control in Offline Decision Making [48.998030470623384]
offline decision-making requires reliable behaviors from fixed datasets without further interaction.<n>We propose a compositional model-based diffusion framework consisting of: (i) a planner that generates diverse, task-aligned trajectories; (ii) a dynamics model that enforces consistency with the underlying system dynamics; and (iii) a ranker module that selects behaviors aligned with the task objectives.
arXiv Detail & Related papers (2025-12-09T06:26:02Z) - DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving [65.7087560656003]
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse.<n>We propose DiffusionDriveV2, which leverages reinforcement learning to constrain low-quality modes and explore for superior trajectories.<n>This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model.
arXiv Detail & Related papers (2025-12-08T17:29:52Z) - Dynamically Scaled Activation Steering [3.177576903071419]
We introduce Dynamically Scaled Activation Steering (DSAS), a method-agnostic steering framework that decouples when to steer from how to steer.<n>DSAS adaptively modulates the strength of existing steering transformations across layers and inputs, intervening strongly only when undesired behavior is detected.
arXiv Detail & Related papers (2025-12-03T10:50:15Z) - Preference-Based Alignment of Discrete Diffusion Models [14.874943508610857]
We introduce Discrete Diffusion DPO (D2-DPO), the first adaptation of Direct Preference Optimization (DPO) to discrete diffusion models formulated as continuous-time Markov chains.<n>Our approach derives a novel loss function that directly fine-tunes the generative process using preference data while preserving fidelity to a reference distribution.<n>Our results highlight that D2-DPO enables controlled fine-tuning without requiring explicit reward models, making it a practical alternative to reinforcement learning-based approaches.
arXiv Detail & Related papers (2025-03-11T11:07:35Z) - Controllable Motion Generation via Diffusion Modal Coupling [19.534234002173314]
We propose a novel framework that enhances controllability in diffusion models by leveraging multi-modal prior distributions.<n>We evaluate our approach on motion prediction using a dataset and multi-task control in Maze2D environments.
arXiv Detail & Related papers (2025-03-04T07:22:34Z) - Domain-Specific Risk Minimization for Out-of-Distribution Generalization [104.17683265084757]
We first establish a generalization bound that explicitly considers the adaptivity gap.
We propose effective gap estimation methods for guiding the selection of a better hypothesis for the target.
The other method is minimizing the gap directly by adapting model parameters using online target samples.
arXiv Detail & Related papers (2022-08-18T06:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.