Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
- URL: http://arxiv.org/abs/2505.23349v1
- Date: Thu, 29 May 2025 11:12:00 GMT
- Title: Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
- Authors: Sheng Ouyang, Yulan Hu, Ge Chen, Qingyang Li, Fuzheng Zhang, Yong Liu,
- Abstract summary: This paper collectively defines the various biases present in rewards as the problem of reward unfairness.<n>We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective.
- Score: 16.82198859401237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner.
Related papers
- Guiding LLM Decision-Making with Fairness Reward Models [12.32062012708603]
Large language models are increasingly used to support high-stakes decisions.<n>We propose a framework for training a generalizable Fairness Reward Model.<n>We show that our approach consistently improves fairness while matching, or even surpassing, baseline accuracy.
arXiv Detail & Related papers (2025-07-15T14:20:23Z) - Bias Fitting to Mitigate Length Bias of Reward Model in RLHF [81.44256822500257]
Reinforcement Learning from Human Feedback relies on reward models to align large language models with human preferences.<n>We propose FiMi-RM, a framework that autonomously learns and corrects underlying bias patterns.<n> Experimental results demonstrate that FiMi-RM achieves a more balanced length-reward distribution.
arXiv Detail & Related papers (2025-05-19T08:29:28Z) - Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners [15.25763345316458]
Reinforcement learning agents are fundamentally limited by the quality of the reward functions they learn from.<n>We introduce the Trajectory Alignment Coefficient to quantify the similarity between a human stakeholder's ranking of trajectory distributions and those induced by a given reward function.
arXiv Detail & Related papers (2025-03-08T00:38:17Z) - Fairness Aware Reinforcement Learning via Proximal Policy Optimization [7.061167083587786]
This paper introduces fairness in Proximal Policy Optimization (PPO) with a penalty term derived from demographic parity, counterfactual fairness, and conditional statistical parity.<n>We evaluate our approach in the Allelopathic Harvest game, a cooperative and competitive MAS focused on resource collection.
arXiv Detail & Related papers (2025-02-06T10:45:55Z) - Towards Harmless Rawlsian Fairness Regardless of Demographic Prior [57.30787578956235]
We explore the potential for achieving fairness without compromising its utility when no prior demographics are provided to the training set.
We propose a simple but effective method named VFair to minimize the variance of training losses inside the optimal set of empirical losses.
arXiv Detail & Related papers (2024-11-04T12:40:34Z) - Evaluating the Fairness of Discriminative Foundation Models in Computer
Vision [51.176061115977774]
We propose a novel taxonomy for bias evaluation of discriminative foundation models, such as Contrastive Language-Pretraining (CLIP)
We then systematically evaluate existing methods for mitigating bias in these models with respect to our taxonomy.
Specifically, we evaluate OpenAI's CLIP and OpenCLIP models for key applications, such as zero-shot classification, image retrieval and image captioning.
arXiv Detail & Related papers (2023-10-18T10:32:39Z) - Provable Benefits of Policy Learning from Human Preferences in
Contextual Bandit Problems [82.92678837778358]
preference-based methods have demonstrated substantial success in empirical applications such as InstructGPT.
We show how human bias and uncertainty in feedback modelings can affect the theoretical guarantees of these approaches.
arXiv Detail & Related papers (2023-07-24T17:50:24Z) - How Robust is Your Fairness? Evaluating and Sustaining Fairness under
Unseen Distribution Shifts [107.72786199113183]
We propose a novel fairness learning method termed CUrvature MAtching (CUMA)
CUMA achieves robust fairness generalizable to unseen domains with unknown distributional shifts.
We evaluate our method on three popular fairness datasets.
arXiv Detail & Related papers (2022-07-04T02:37:50Z) - Towards Equal Opportunity Fairness through Adversarial Learning [64.45845091719002]
Adversarial training is a common approach for bias mitigation in natural language processing.
We propose an augmented discriminator for adversarial training, which takes the target class as input to create richer features.
arXiv Detail & Related papers (2022-03-12T02:22:58Z) - Learning Fair Policies in Multiobjective (Deep) Reinforcement Learning
with Average and Discounted Rewards [15.082715993594121]
We investigate the problem of learning a policy that treats its users equitably.
In this paper, we formulate this novel RL problem, in which an objective function, which encodes a notion of fairness, is optimized.
We describe how several classic deep RL algorithms can be adapted to our fair optimization problem.
arXiv Detail & Related papers (2020-08-18T07:17:53Z) - Recovering from Biased Data: Can Fairness Constraints Improve Accuracy? [11.435833538081557]
Empirical Risk Minimization (ERM) may produce a classifier that not only is biased but also has suboptimal accuracy on the true data distribution.
We examine the ability of fairness-constrained ERM to correct this problem.
We also consider other recovery methods including reweighting the training data, Equalized Odds, and Demographic Parity.
arXiv Detail & Related papers (2019-12-02T22:00:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.