Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning
- URL: http://arxiv.org/abs/2602.01528v1
- Date: Mon, 02 Feb 2026 01:43:48 GMT
- Title: Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning
- Authors: Qian Wang, Xuandong Zhao, Zirui Zhang, Zhanzhi Lou, Nuo Chen, Dawn Song, Bingsheng He,
- Abstract summary: Large language models (LLMs) increasingly serve as automated judges, yet they remain susceptible to cognitive biases.<n>We propose Epistemic Independence Training (EIT), a reinforcement learning framework grounded in a key principle.<n>EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers.
- Score: 91.8584139564909
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) increasingly serve as automated judges, yet they remain susceptible to cognitive biases -- often altering their reasoning when faced with spurious prompt-level cues such as consensus claims or authority appeals. Existing mitigations via prompting or supervised fine-tuning fail to generalize, as they modify surface behavior without changing the optimization objective that makes bias cues predictive. To address this gap, we propose Epistemic Independence Training (EIT), a reinforcement learning framework grounded in a key principle: to learn independence, bias cues must be made non-predictive of reward. EIT operationalizes this through a balanced conflict strategy where bias signals are equally likely to support correct and incorrect answers, combined with a reward design that penalizes bias-following without rewarding bias agreement. Experiments on Qwen3-4B demonstrate that EIT improves both accuracy and robustness under adversarial biases, while preserving performance when bias aligns with truth. Notably, models trained only on bandwagon bias generalize to unseen bias types such as authority and distraction, indicating that EIT induces transferable epistemic independence rather than bias-specific heuristics. Code and data are available at https://anonymous.4open.science/r/bias-mitigation-with-rl-BC47.
Related papers
- ALBAR: Adversarial Learning approach to mitigate Biases in Action Recognition [52.537021302246664]
Action recognition models often suffer from background bias (i.e., inferring actions based on background cues) and foreground bias (i.e., relying on subject appearance)<n>We propose ALBAR, a novel adversarial training method that mitigates foreground and background biases without requiring specialized knowledge of the bias attributes.<n>We evaluate our method on established background and foreground bias protocols, setting a new state-of-the-art and strongly improving combined debiasing performance by over 12% absolute on HMDB51.
arXiv Detail & Related papers (2025-01-31T20:47:06Z) - Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z) - Take Care of Your Prompt Bias! Investigating and Mitigating Prompt Bias in Factual Knowledge Extraction [56.17020601803071]
Recent research shows that pre-trained language models (PLMs) suffer from "prompt bias" in factual knowledge extraction.
This paper aims to improve the reliability of existing benchmarks by thoroughly investigating and mitigating prompt bias.
arXiv Detail & Related papers (2024-03-15T02:04:35Z) - Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought [33.32335629744919]
Chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning.<n>CoT can also systematically misrepresent the factors influencing models' behavior.<n>We first create a new dataset of 9 different biases that affect GPT-3.5-Turbo and Llama-8b models.
arXiv Detail & Related papers (2024-03-08T18:41:42Z) - Signal Is Harder To Learn Than Bias: Debiasing with Focal Loss [10.031357641396616]
neural networks are notorious for learning unwanted associations, also known as biases, instead of the underlying decision rule.
We propose Signal is Harder, a variational-autoencoder-based method that simultaneously trains a biased and unbiased classifier.
We propose a perturbation scheme in the latent space for visualizing the bias that helps practitioners become aware of the sources of spurious correlations.
arXiv Detail & Related papers (2023-05-31T09:09:59Z) - Self-supervised debiasing using low rank regularization [59.84695042540525]
Spurious correlations can cause strong biases in deep neural networks, impairing generalization ability.
We propose a self-supervised debiasing framework potentially compatible with unlabeled samples.
Remarkably, the proposed debiasing framework significantly improves the generalization performance of self-supervised learning baselines.
arXiv Detail & Related papers (2022-10-11T08:26:19Z) - Unsupervised Learning of Unbiased Visual Representations [12.690228982893]
Deep neural networks often struggle to learn robust representations in the presence of dataset biases.<n>Existing approaches to address this problem typically involve explicit supervision of bias attributes or reliance on prior knowledge about the biases.<n>We present a fully unsupervised debiasing framework with three key steps.
arXiv Detail & Related papers (2022-04-26T10:51:50Z) - General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space.
GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z) - Learning Debiased Models with Dynamic Gradient Alignment and
Bias-conflicting Sample Mining [39.00256193731365]
Deep neural networks notoriously suffer from dataset biases which are detrimental to model robustness, generalization and fairness.
We propose a two-stage debiasing scheme to combat against the intractable unknown biases.
arXiv Detail & Related papers (2021-11-25T14:50:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.