Related papers: Multi-Attribute Steering of Language Models via Targeted Intervention

Multi-Attribute Steering of Language Models via Targeted Intervention

URL: http://arxiv.org/abs/2502.12446v2
Date: Wed, 09 Jul 2025 17:31:20 GMT
Title: Multi-Attribute Steering of Language Models via Targeted Intervention
Authors: Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal,
Abstract summary: Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction.<n>We introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes.
Score: 56.93583799109029
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).

Related papers

Unified modality separation: A vision-language framework for unsupervised domain adaptation [60.8391821117794]
Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains.<n>We propose a unified modality separation framework that accommodates both modality-specific and modality-invariant components.<n>Our methods achieve up to 9% performance gain with 9 times of computational efficiencies.
arXiv Detail & Related papers (2025-08-07T02:51:10Z)
Stochastic Encodings for Active Feature Acquisition [100.47043816019888]
Active Feature Acquisition is an instance-wise, sequential decision making problem.<n>The aim is to dynamically select which feature to measure based on current observations, independently for each test instance.<n>Common approaches either use Reinforcement Learning, which experiences training difficulties, or greedily maximize the conditional mutual information of the label and unobserved features, which makes myopic.<n>We introduce a latent variable model, trained in a supervised manner. Acquisitions are made by reasoning about the features across many possible unobserved realizations in a latent space.
arXiv Detail & Related papers (2025-08-03T23:48:46Z)
Test-Time Consistency in Vision Language Models [26.475993408532304]
Vision-Language Models (VLMs) have achieved impressive performance across a wide range of multimodal tasks.<n>Recent benchmarks, such as MM-R3, highlight that even state-of-the-art VLMs can produce divergent predictions across semantically equivalent inputs.<n>We propose a simple and effective test-time consistency framework that enhances semantic consistency without supervised re-training.
arXiv Detail & Related papers (2025-06-27T17:09:44Z)
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models [1.6874375111244329]
We introduce K-Steering, a unified and flexible approach that trains a single non-linear multi-label classifier on hidden activations.<n>This avoids linearity assumptions, removes the need for storing and tuning separate vectors attribute, and allows dynamic composition of behaviors without retraining.<n> Empirical results across 3 model families, validated by both activation-based classifiers and LLM-based judges, demonstrate that K-Steering outperforms strong baselines in accurately steering multiple behaviors.
arXiv Detail & Related papers (2025-05-30T12:41:19Z)
AdaRank: Adaptive Rank Pruning for Enhanced Model Merging [15.383220675351076]
Model merging has emerged as a promising approach for unifying independently fine-tuned models into an integrated framework. We propose AdaRank, a novel model merging framework that adaptively selects the most beneficial singular directions of task vectors to merge multiple models. AdaRank consistently achieves state-of-the-art performance with various backbones and number of tasks, reducing the performance gap between fine-tuned models to nearly 1%.
arXiv Detail & Related papers (2025-03-28T06:49:06Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
Multi-Attribute Constraint Satisfaction via Language Model Rewriting [67.5778646504987]
Multi-Attribute Constraint Satisfaction (MACS) is a method capable of finetuning language models to satisfy user-specified constraints on multiple external real-value attributes.<n>Our work opens new avenues for generalized and real-value multi-attribute control, with implications for diverse applications spanning NLP and bioinformatics.
arXiv Detail & Related papers (2024-12-26T12:36:39Z)
Dynamic Adaptive Optimization for Effective Sentiment Analysis Fine-Tuning on Large Language Models [0.0]
Large language models (LLMs) have become a popular paradigm for sentiment analysis, leveraging multi-task learning to address specific tasks concurrently. We propose a novel multi-task learning framework with a dynamic adaptive optimization (DAO) module. This work improves the Mean Squared Error (MSE) and Accuracy (ACC) by 15.58% and 1.24% respectively, compared with previous work.
arXiv Detail & Related papers (2024-08-15T19:13:38Z)
Exploring Test-Time Adaptation for Object Detection in Continually Changing Environments [13.163784646113214]
Continual Test-Time Adaptation (CTTA) has recently emerged as a promising technique to gradually adapt a source-trained model to continually changing target domains. We present AMROD, featuring three core components. Firstly, the object-level contrastive learning module extracts object-level features for contrastive learning to refine the feature representation in the target domain. Secondly, the adaptive monitoring module dynamically skips unnecessary adaptation and updates the category-specific threshold based on predicted confidence scores to enable efficiency and improve the quality of pseudo-labels.
arXiv Detail & Related papers (2024-06-24T08:30:03Z)
Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis [53.38518232934096]
Multi-task learning (MTL) is a powerful machine learning paradigm designed to leverage shared knowledge across tasks to improve generalization and performance. We propose an MTL approach at the intersection between task clustering and feature transformation based on a two-phase iterative aggregation of targets and features. In both phases, a key aspect is to preserve the interpretability of the reduced targets and features through the aggregation with the mean, which is motivated by applications to Earth science.
arXiv Detail & Related papers (2024-06-12T08:30:16Z)
InterroGate: Learning to Share, Specialize, and Prune Representations for Multi-task Learning [17.66308231838553]
We propose a novel multi-task learning (MTL) architecture designed to mitigate task interference while optimizing inference computational efficiency. We employ a learnable gating mechanism to automatically balance the shared and task-specific representations while preserving the performance of all tasks.
arXiv Detail & Related papers (2024-02-26T18:59:52Z)
Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
Low-Rank Multitask Learning based on Tensorized SVMs and LSSVMs [65.42104819071444]
Multitask learning (MTL) leverages task-relatedness to enhance performance. We employ high-order tensors, with each mode corresponding to a task index, to naturally represent tasks referenced by multiple indices. We propose a general framework of low-rank MTL methods with tensorized support vector machines (SVMs) and least square support vector machines (LSSVMs)
arXiv Detail & Related papers (2023-08-30T14:28:26Z)
Hierarchical Disentanglement-Alignment Network for Robust SAR Vehicle Recognition [18.38295403066007]
HDANet integrates feature disentanglement and alignment into a unified framework. The proposed method demonstrates impressive robustness across nine operating conditions in the MSTAR dataset.
arXiv Detail & Related papers (2023-04-07T09:11:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.