Related papers: Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs

Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs

URL: http://arxiv.org/abs/2503.05371v1
Date: Fri, 07 Mar 2025 12:25:29 GMT
Title: Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs
Authors: Zara Siddique, Irtaza Khalid, Liam D. Turner, Luis Espinosa-Anke,
Abstract summary: We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes.<n>We employ Bayesian optimization to systematically identify effective contrastive pair datasets across nine bias axes.<n>Building on these promising results, we introduce Steering Vector Ensembles (SVE), a method that averages multiple individually optimized steering vectors, each targeting a specific bias axis such as age, race, or gender.
Score: 8.91107152198979
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We employ Bayesian optimization to systematically identify effective contrastive pair datasets across nine bias axes. When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.2%, 4.7%, and 3.2% over the baseline for Mistral, Llama, and Qwen, respectively. Building on these promising results, we introduce Steering Vector Ensembles (SVE), a method that averages multiple individually optimized steering vectors, each targeting a specific bias axis such as age, race, or gender. By leveraging their collective strength, SVE outperforms individual steering vectors in both bias reduction and maintaining model performance. The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that SVE is a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.

Related papers

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs [56.93583799109029]
GrAInS is an inference-time steering approach that operates across both language-only and vision-language models and tasks.<n>During inference, GrAInS hidden activations at transformer layers guided by token-level attribution signals, and normalizes activations to preserve representational scale.<n>It consistently outperforms both fine-tuning and existing steering baselines.
arXiv Detail & Related papers (2025-07-24T02:34:13Z)
Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z)
Leveraging Robust Optimization for LLM Alignment under Distribution Shifts [52.983390470606146]
Preference alignment methods are increasingly critical for steering large language models to generate outputs consistent with human values.<n>We propose a novel distribution-aware optimization framework that improves preference alignment despite such shifts.
arXiv Detail & Related papers (2025-04-08T09:14:38Z)
Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Investigating Generalization of One-shot LLM Steering Vectors [21.2431937128876]
We propose optimizing steering vectors through gradient descent on a single training example.<n>We find that the resulting vectors effectively mediate safety-relevant behaviors in multiple models.
arXiv Detail & Related papers (2025-02-26T06:13:01Z)
Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering [7.471995248769638]
We propose a training-free debiasing framework for large Multi-Modal Models (LMMs) Our framework intervenes on the model's representations during text generation by constructing a steering vector that reduces reference on protected attributes. Our experiments show that these interventions effectively reduce the propensity of LMMs to generate text related to protected attributes while maintaining sentiment and fluency.
arXiv Detail & Related papers (2024-11-15T20:06:09Z)
Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss. The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z)
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment [57.03947082589616]
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. We study this and find that preference data gives a better learning signal when the underlying responses are contrastive. We introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%.
arXiv Detail & Related papers (2024-08-12T16:24:51Z)
A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques [63.10251271444959]
Large language models are first pre-trained on trillions of tokens and then instruction-tuned or aligned to specific preferences. We conduct an in-depth investigation of the impact of popular choices for three crucial axes. Our setup spanning over 300 experiments reveals consistent trends and unexpected findings.
arXiv Detail & Related papers (2024-06-07T12:25:51Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization [34.05163996072159]
"steering vectors" are extracted from the activations of human preference data. This work proposes an innovative approach that could produce more effective steering vectors through bi-directional preference optimization. Our method is designed to allow steering vectors to directly influence the generation probability of contrastive human preference data pairs.
arXiv Detail & Related papers (2024-05-28T05:10:40Z)
Empirical Study on Optimizer Selection for Out-of-Distribution Generalization [16.386766049451317]
Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution. In this study, we examine the performance of popular first-order generalizations for different classes of distributional shift.
arXiv Detail & Related papers (2022-11-15T23:56:30Z)
Debiasing Learning for Membership Inference Attacks Against Recommender Systems [79.48353547307887]
Learned recommender systems may inadvertently leak information about their training data, leading to privacy violations. We investigate privacy threats faced by recommender systems through the lens of membership inference. We propose a Debiasing Learning for Membership Inference Attacks against recommender systems (DL-MIA) framework that has four main components.
arXiv Detail & Related papers (2022-06-24T17:57:34Z)
Control-Aware Prediction Objectives for Autonomous Driving [78.19515972466063]
We present control-aware prediction objectives (CAPOs) to evaluate the downstream effect of predictions on control without requiring the planner be differentiable. We propose two types of importance weights that weight the predictive likelihood: one using an attention model between agents, and another based on control variation when exchanging predicted trajectories for ground truth trajectories.
arXiv Detail & Related papers (2022-04-28T07:37:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.