Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
- URL: http://arxiv.org/abs/2509.03647v1
- Date: Wed, 03 Sep 2025 18:52:55 GMT
- Title: Breaking the Mirror: Activation-Based Mitigation of Self-Preference in LLM Evaluators
- Authors: Dani Roytburg, Matthew Bozoukov, Matthew Nguyen, Jou Barzdukas, Simon Fu, Narmeen Oozeer,
- Abstract summary: "Self-preference bias" undermines fairness and reliability in evaluation pipelines.<n>We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining.
- Score: 3.07869141026886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) increasingly serve as automated evaluators, yet they suffer from "self-preference bias": a tendency to favor their own outputs over those of other models. This bias undermines fairness and reliability in evaluation pipelines, particularly for tasks like preference tuning and model routing. We investigate whether lightweight steering vectors can mitigate this problem at inference time without retraining. We introduce a curated dataset that distinguishes self-preference bias into justified examples of self-preference and unjustified examples of self-preference, and we construct steering vectors using two methods: Contrastive Activation Addition (CAA) and an optimization-based approach. Our results show that steering vectors can reduce unjustified self-preference bias by up to 97\%, substantially outperforming prompting and direct preference optimization baselines. Yet steering vectors are unstable on legitimate self-preference and unbiased agreement, implying self-preference spans multiple or nonlinear directions. This underscores both their promise and limits as safeguards for LLM-as-judges and motivates more robust interventions.
Related papers
- One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs [8.089908150148554]
Vision Language Models (VLMs) achieve strong performance on multimodal tasks but still suffer from hallucination and safety-related failures.<n>We propose textbfOSGA (textbfOne-shot textbfSteering with textbfGenerative textbfAnchor), an input-independent framework that improves model performance with a single optimization instance.
arXiv Detail & Related papers (2026-01-30T14:47:59Z) - InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization [18.988527161000203]
We propose Intrinsic Self-reflective Preference Optimization (InSPO), deriving a globally optimal policy conditioning on both context and alternative responses.<n>InSPO serves as a plug-and-play enhancement without architectural changes or inference overhead.
arXiv Detail & Related papers (2025-12-29T00:59:23Z) - Judging with Confidence: Calibrating Autoraters to Preference Distributions [56.17041629492863]
We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population.<n>We present two learning methods tailored to different data conditions.<n>Our results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution.
arXiv Detail & Related papers (2025-09-30T20:36:41Z) - Adaptive Margin RLHF via Preference over Preferences [44.328333474444214]
We argue that modeling the strength of preferences can lead to better generalization and more faithful alignment.<n>We introduce an extension to Direct Preference Optimization (DPO), DPO-PoP, that incorporates adaptive margins from preference-over-preference supervision.
arXiv Detail & Related papers (2025-09-26T19:03:24Z) - Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors [13.630818884973127]
We propose Preference Vector, a novel framework inspired by task arithmetic.<n>Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time.<n> Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.
arXiv Detail & Related papers (2025-04-27T12:16:51Z) - Do LLM Evaluators Prefer Themselves for a Reason? [21.730128682888168]
Large language models (LLMs) are increasingly used as automatic evaluators in applications such as benchmarking, reward modeling, and self-refinement.<n>Prior work highlights a potential self-preference bias where LLMs favor their own generated responses.<n>This raises a critical question: Is self-preference detrimental, or does it simply reflect objectively superior outputs from more capable models?
arXiv Detail & Related papers (2025-04-04T18:09:23Z) - Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs [8.91107152198979]
We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes.<n>We compute 8 steering vectors, each corresponding to a different social bias axis, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets.<n>When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet.
arXiv Detail & Related papers (2025-03-07T12:25:29Z) - One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs [21.2431937128876]
We propose optimizing steering vectors through gradient descent on a single training example.<n>We find that the resulting SVs effectively mediate safety-relevant behaviors in multiple models.<n>We extend work on "emergent misalignment" and show that SVs optimized to induce a model to write vulnerable code cause the model to respond harmfully.
arXiv Detail & Related papers (2025-02-26T06:13:01Z) - SPRec: Self-Play to Debias LLM-based Recommendation [23.875509546540904]
Large language models (LLMs) have attracted significant attention in recommendation systems.<n>We propose SPRec, a novel self-play framework designed to mitigate over-recommendation and improve fairness without requiring additional data or manual intervention.
arXiv Detail & Related papers (2024-12-12T12:53:30Z) - Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization [60.176008034221404]
Direct Preference Optimization (DPO) and its variants are increasingly used for aligning language models with human preferences.<n>Prior work has observed that the likelihood of preferred responses often decreases during training.<n>We demonstrate that likelihood displacement can be catastrophic, shifting probability mass from preferred responses to responses with an opposite meaning.
arXiv Detail & Related papers (2024-10-11T14:22:44Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences.<n>Our key idea is leveraging the human prior knowledge within the small (seed) data.<n>We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Dissecting Human and LLM Preferences [80.55271307662365]
We find that humans are less sensitive to errors, favor responses that support their stances, and show clear dislike when models admit their limits.
advanced LLMs like GPT-4-Turbo emphasize correctness, clarity, and harmlessness more.
We show that preference-based evaluation can be intentionally manipulated.
arXiv Detail & Related papers (2024-02-17T14:34:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.