DSO: Direct Steering Optimization for Bias Mitigation
- URL: http://arxiv.org/abs/2512.15926v2
- Date: Sat, 20 Dec 2025 01:38:32 GMT
- Title: DSO: Direct Steering Optimization for Bias Mitigation
- Authors: Lucas Monteiro Paes, Nivedha Sivakumar, Yinong Oliver Wang, Masha Fedzechkina Donaldson, Barry-John Theobald, Luca Zappella, Nicholas Apostoloff,
- Abstract summary: Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals.<n>Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors.<n>We propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance.
- Score: 12.033608044339717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative models are often deployed to make decisions on behalf of users, such as vision-language models (VLMs) identifying which person in a room is a doctor to help visually impaired individuals. Yet, VLM decisions are influenced by the perceived demographic attributes of people in the input, which can lead to biased outcomes like failing to identify women as doctors. Moreover, when reducing bias leads to performance loss, users may have varying needs for balancing bias mitigation with overall model capabilities, highlighting the demand for methods that enable controllable bias reduction during inference. Activation steering is a popular approach for inference-time controllability that has shown potential in inducing safer behavior in large language models (LLMs). However, we observe that current steering methods struggle to correct biases, where equiprobable outcomes across demographic groups are required. To address this, we propose Direct Steering Optimization (DSO) which uses reinforcement learning to find linear transformations for steering activations, tailored to mitigate bias while maintaining control over model performance. We demonstrate that DSO achieves state-of-the-art trade-off between fairness and capabilities on both VLMs and LLMs, while offering practitioners inference-time control over the trade-off. Overall, our work highlights the benefit of designing steering strategies that are directly optimized to control model behavior, providing more effective bias intervention than methods that rely on pre-defined heuristics for controllability.
Related papers
- PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding [85.22047087898311]
We introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings.<n>PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses to reinforce desirable outcomes.<n>Experiments on the "3H" alignment objectives demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time.
arXiv Detail & Related papers (2026-02-24T08:56:52Z) - Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics [81.80010043113445]
Local weight fine-tuning, LoRA-based adaptation, and activation-based interventions are studied in isolation.<n>We present a unified view that frames these interventions as dynamic weight updates induced by a control signal.<n>Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility.
arXiv Detail & Related papers (2026-02-02T17:04:36Z) - Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity [45.92643973404507]
We investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies.<n>Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts.
arXiv Detail & Related papers (2026-01-10T15:16:23Z) - Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification [59.59359638389348]
We propose a Dual-level Modality Debiasing Learning framework that implements debiasing at both the model and optimization levels.<n>Experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
arXiv Detail & Related papers (2025-12-03T12:43:16Z) - Guiding LLM Decision-Making with Fairness Reward Models [12.32062012708603]
Large language models are increasingly used to support high-stakes decisions.<n>We propose a framework for training a generalizable Fairness Reward Model.<n>We show that our approach consistently improves fairness while matching, or even surpassing, baseline accuracy.
arXiv Detail & Related papers (2025-07-15T14:20:23Z) - Human-assisted Robotic Policy Refinement via Action Preference Optimization [26.144183856600687]
Action Preference Optimization (APO) is a method designed to refine Vision-Language-Action (VLA) models by human-assisted preference alignment.<n>To solve this, APO proposes an adaptive reweighting algorithm with binary desirability signals derived from interaction.<n>Experiments conducted in simulation and real-world scenarios prove superior generalization and robustness.
arXiv Detail & Related papers (2025-06-08T13:14:18Z) - Active Test-time Vision-Language Navigation [60.69722522420299]
ATENA is a test-time active learning framework that enables a practical human-robot interaction via episodic feedback on uncertain navigation outcomes.<n>In particular, ATENA learns to increase certainty in successful episodes and decrease it in failed ones, improving uncertainty calibration.<n>In addition, we propose a self-active learning strategy that enables an agent to evaluate its navigation outcomes based on confident predictions.
arXiv Detail & Related papers (2025-06-07T02:24:44Z) - Beyond Templates: Dynamic Adaptation of Reasoning Demonstrations via Feasibility-Aware Exploration [15.711365331854614]
We introduce Dynamic Adaptation of Reasoning Trajectories (DART), a novel data adaptation framework.<n>Instead of uniformly imitating expert steps, DART employs a selective imitation strategy guided by step-wise adaptability estimation.<n>We validate DART across multiple reasoning benchmarks and model scales, demonstrating that it significantly improves generalization and data efficiency.
arXiv Detail & Related papers (2025-05-27T04:08:11Z) - Self-Adaptive Cognitive Debiasing for Large Language Models in Decision-Making [71.71796367760112]
Large language models (LLMs) have shown potential in supporting decision-making applications.<n>We propose a cognitive debiasing approach, self-adaptive cognitive debiasing (SACD)<n>We evaluate SACD on finance, healthcare, and legal decision-making tasks using both open-weight and closed-weight LLMs.
arXiv Detail & Related papers (2025-04-05T11:23:05Z) - Understanding Endogenous Data Drift in Adaptive Models with Recourse-Seeking Users [6.782864450313782]
We study user strategic behaviors and their interactions with decision-making systems under resource constraints and competitive dynamics.<n>We propose two methods--Fair-top-k and Dynamic Continual Learning--which significantly reduce recourse cost and improve model robustness.<n>Our findings draw connections to economic theories, highlighting how algorithmic decision-making can unintentionally reinforce a higher standard and generate endogenous barriers to entry.
arXiv Detail & Related papers (2025-03-12T12:17:34Z) - Asymptotically Fair Participation in Machine Learning Models: an Optimal
Control Perspective [21.962258178900065]
The performance of state-of-the-art machine learning models often deteriorates when testing on demographics that are under-represented in the training dataset.
We aim to address the problem of achieving skewedally fair participation via optimal control formulation.
We apply an efficient implementation of Pontryagin's maximum principle to estimate the optimal control solution.
arXiv Detail & Related papers (2023-11-16T22:28:38Z) - Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form.
The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight.
Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - Efficient Empowerment Estimation for Unsupervised Stabilization [75.32013242448151]
empowerment principle enables unsupervised stabilization of dynamical systems at upright positions.
We propose an alternative solution based on a trainable representation of a dynamical system as a Gaussian channel.
We show that our method has a lower sample complexity, is more stable in training, possesses the essential properties of the empowerment function, and allows estimation of empowerment from images.
arXiv Detail & Related papers (2020-07-14T21:10:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.