Fairness Aware Reward Optimization
- URL: http://arxiv.org/abs/2602.07799v1
- Date: Sun, 08 Feb 2026 03:35:49 GMT
- Title: Fairness Aware Reward Optimization
- Authors: Ching Lam Choi, Vighnesh Subramaniam, Phillip Isola, Antonio Torralba, Stefanie Jegelka,
- Abstract summary: We introduce Fairness Aware Reward Optimization (Faro), an in-processing framework that trains reward models under demographic parity, equalized odds, or counterfactual fairness constraints.<n>We provide the first theoretical analysis of reward-level fairness in LLM alignment.<n>Faro significantly reduces bias and harmful generations while maintaining or improving model quality.
- Score: 78.85867531002346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Demographic skews in human preference data propagate systematic unfairness through reward models into aligned LLMs. We introduce Fairness Aware Reward Optimization (Faro), an in-processing framework that trains reward models under demographic parity, equalized odds, or counterfactual fairness constraints. We provide the first theoretical analysis of reward-level fairness in LLM alignment, establishing: (i) provable fairness certificates for Faro-trained rewards with controllable slack; a (ii) formal characterization of the accuracy-fairness trade-off induced by KL-regularized fine-tuning, proving fairness transfers from reward to policy; and the (iii) existence of a non-empty Pareto frontier. Unlike pre- and post-processing methods, Faro ensures reward models are simultaneously ordinal (ranking correctly), cardinal (calibrated), and fair. Across multiple LLMs and benchmarks, Faro significantly reduces bias and harmful generations while maintaining or improving model quality.
Related papers
- Fairness-informed Pareto Optimization : An Efficient Bilevel Framework [9.47506642944168]
We present BADR, a framework to recover the optimal model for any fairness metric.<n>We equip BADR with two novel large-scale, single-loop algorithms, BADR-GD and BADR-SGD.<n>Badr is an open-source Python toolbox implementing our framework for a variety of learning tasks and fairness metrics.
arXiv Detail & Related papers (2026-01-19T23:05:07Z) - GDRO: Group-level Reward Post-training Suitable for Diffusion Models [55.948229011478304]
Group-level rewards successfully align the model with the targeted reward.<n>Group-level Direct Reward Optimization (GDRO) is a new post-training paradigm for group-level reward alignment.<n>GDRO supports full offline training that saves the large time cost for image rollout sampling.<n>It is diffusion-sampler-independent, which eliminates the need for the ODE-to-SDE approximation to obtainity.
arXiv Detail & Related papers (2026-01-05T11:47:18Z) - Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking [78.69179041551014]
We propose an information-theoretic reward modeling framework based on the Information Bottleneck principle.<n>We show that InfoRM filters out preference-irrelevant information to alleviate reward misgeneralization.<n>We also introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape.
arXiv Detail & Related papers (2025-10-15T15:51:59Z) - Guiding LLM Decision-Making with Fairness Reward Models [12.32062012708603]
Large language models are increasingly used to support high-stakes decisions.<n>We propose a framework for training a generalizable Fairness Reward Model.<n>We show that our approach consistently improves fairness while matching, or even surpassing, baseline accuracy.
arXiv Detail & Related papers (2025-07-15T14:20:23Z) - BiFair: A Fairness-aware Training Framework for LLM-enhanced Recommender Systems via Bi-level Optimization [13.187285894531275]
BiFair is a fairness-aware training framework designed to mitigate both prior and training unfairness simultaneously.<n>Extensive experiments on three real-world datasets demonstrate that BiFair significantly mitigates unfairness and outperforms previous state-of-the-art methods.
arXiv Detail & Related papers (2025-07-06T08:39:26Z) - FedFACT: A Provable Framework for Controllable Group-Fairness Calibration in Federated Learning [23.38141950440522]
We propose a controllable federated group-fairness calibration framework, named FedFACT.<n>FedFACT identifies the Bayes-optimal classifiers under both global and local fairness constraints.<n>We show that FedFACT consistently outperforms baselines in balancing accuracy and global-local fairness.
arXiv Detail & Related papers (2025-06-04T09:39:57Z) - The Other Side of the Coin: Exploring Fairness in Retrieval-Augmented Generation [73.16564415490113]
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant document from external knowledge sources.<n>We propose two approaches, FairFT and FairFilter, to mitigate the fairness issues introduced by RAG for small-scale LLMs.
arXiv Detail & Related papers (2025-04-11T10:17:10Z) - FairLoRA: Unpacking Bias Mitigation in Vision Models with Fairness-Driven Low-Rank Adaptation [3.959853359438669]
We introduce FairLoRA, a novel fairness-specific regularizer for Low Rank Adaptation (LoRA)
Our results demonstrate that the need for higher ranks to mitigate bias is not universal; it depends on factors such as the pre-trained model, dataset, and task.
arXiv Detail & Related papers (2024-10-22T18:50:36Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Chasing Fairness Under Distribution Shift: A Model Weight Perturbation
Approach [72.19525160912943]
We first theoretically demonstrate the inherent connection between distribution shift, data perturbation, and model weight perturbation.
We then analyze the sufficient conditions to guarantee fairness for the target dataset.
Motivated by these sufficient conditions, we propose robust fairness regularization (RFR)
arXiv Detail & Related papers (2023-03-06T17:19:23Z) - Fairness Reprogramming [42.65700878967251]
We propose a new generic fairness learning paradigm, called FairReprogram, which incorporates the model reprogramming technique.
Specifically, FairReprogram considers the case where models can not be changed and appends to the input a set of perturbations, called the fairness trigger.
We show both theoretically and empirically that the fairness trigger can effectively obscure demographic biases in the output prediction of fixed ML models.
arXiv Detail & Related papers (2022-09-21T09:37:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.