Related papers: Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models

URL: http://arxiv.org/abs/2503.07806v1
Date: Mon, 10 Mar 2025 19:39:39 GMT
Title: Towards Large Language Models that Benefit for All: Benchmarking Group Fairness in Reward Models
Authors: Kefan Song, Jin Yao, Runnan Jiang, Rohan Chandra, Shangtong Zhang,
Abstract summary: Large Language Models (LLMs) are increasingly powerful and accessible to human users.<n> Ensuring fairness across diverse demographic groups, i.e., group fairness, is a critical ethical concern.<n>This work benchmarks the group fairness of learned reward models.
Score: 16.977176752570617
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Large Language Models (LLMs) become increasingly powerful and accessible to human users, ensuring fairness across diverse demographic groups, i.e., group fairness, is a critical ethical concern. However, current fairness and bias research in LLMs is limited in two aspects. First, compared to traditional group fairness in machine learning classification, it requires that the non-sensitive attributes, in this case, the prompt questions, be the same across different groups. In many practical scenarios, different groups, however, may prefer different prompt questions and this requirement becomes impractical. Second, it evaluates group fairness only for the LLM's final output without identifying the source of possible bias. Namely, the bias in LLM's output can result from both the pretraining and the finetuning. For finetuning, the bias can result from both the RLHF procedure and the learned reward model. Arguably, evaluating the group fairness of each component in the LLM pipeline could help develop better methods to mitigate the possible bias. Recognizing those two limitations, this work benchmarks the group fairness of learned reward models. By using expert-written text from arXiv, we are able to benchmark the group fairness of reward models without requiring the same prompt questions across different demographic groups. Surprisingly, our results demonstrate that all the evaluated reward models (e.g., Nemotron-4-340B-Reward, ArmoRM-Llama3-8B-v0.1, and GRM-llama3-8B-sftreg) exhibit statistically significant group unfairness. We also observed that top-performing reward models (w.r.t. canonical performance metrics) tend to demonstrate better group fairness.

Related papers

On Optimal Steering to Achieve Exact Fairness [29.589891801235083]
Empirically, our optimal steering techniques on both synthetic and real-world datasets improve fairness without diminishing utility.<n>We demonstrate affine steering of LLM representations to reduce bias in multi-class classification.
arXiv Detail & Related papers (2025-09-19T08:37:51Z)
SPARTA ALIGNMENT: Collectively Aligning Multiple Language Models through Combat [73.529925653031]
We propose SPARTA ALIGNMENT, an algorithm to collectively align multiple LLMs through competition and combat.<n>For each iteration, one instruction and two models are selected for a duel, the other models evaluate the two responses, and their evaluation scores are aggregated through a adapted elo-ranking based reputation system.<n>The peer-evaluated combat results then become preference pairs where the winning response is preferred over the losing one, and all models learn from these preferences at the end of each iteration.
arXiv Detail & Related papers (2025-06-05T07:51:23Z)
Quantitative LLM Judges [60.773734899532336]
We propose quantitative LLM judges, which align evaluation scores of existing LLM judges to humans in a given domain.<n>The models are trained to improve the score of the original judge using its rationale and score.<n>Our experiments show that quantitative judges can improve the predictive power of existing judges through post-hoc modeling.
arXiv Detail & Related papers (2025-06-03T14:44:23Z)
FairLoRA: Unpacking Bias Mitigation in Vision Models with Fairness-Driven Low-Rank Adaptation [3.959853359438669]
We introduce FairLoRA, a novel fairness-specific regularizer for Low Rank Adaptation (LoRA) Our results demonstrate that the need for higher ranks to mitigate bias is not universal; it depends on factors such as the pre-trained model, dataset, and task.
arXiv Detail & Related papers (2024-10-22T18:50:36Z)
Identifying and Mitigating Social Bias Knowledge in Language Models [52.52955281662332]
We propose a novel debiasing approach, Fairness Stamp (FAST), which enables fine-grained calibration of individual social biases.<n>FAST surpasses state-of-the-art baselines with superior debiasing performance.<n>This highlights the potential of fine-grained debiasing strategies to achieve fairness in large language models.
arXiv Detail & Related papers (2024-08-07T17:14:58Z)
Inducing Group Fairness in Prompt-Based Language Model Decisions [12.964746511263833]
Novel prompt-based language model (LM) decision making has created new opportunities to solve classification tasks.<n>The remediation toolkit' is incomplete for LM-based decision makers and little is understood about how to improve decision maker group fairness.
arXiv Detail & Related papers (2024-06-24T15:45:20Z)
Few-Shot Fairness: Unveiling LLM's Potential for Fairness-Aware Classification [7.696798306913988]
We introduce a framework outlining fairness regulations aligned with various fairness definitions. We explore the configuration for in-context learning and the procedure for selecting in-context demonstrations using RAG. Experiments conducted with different LLMs indicate that GPT-4 delivers superior results in terms of both accuracy and fairness compared to other models.
arXiv Detail & Related papers (2024-02-28T17:29:27Z)
Fair Abstractive Summarization of Diverse Perspectives [103.08300574459783]
A fair summary should provide a comprehensive coverage of diverse perspectives without underrepresenting certain groups. We first formally define fairness in abstractive summarization as not underrepresenting perspectives of any groups of people. We propose four reference-free automatic metrics by measuring the differences between target and source perspectives.
arXiv Detail & Related papers (2023-11-14T03:38:55Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
UP5: Unbiased Foundation Model for Fairness-aware Recommendation [45.47673627667594]
A growing concern that Large Language Models might inadvertently perpetuate societal stereotypes, resulting in unfair recommendations. This paper focuses on user-side fairness for LLM-based recommendation where the users may require a recommender system to be fair on sensitive features such as gender or age. We introduce a novel Counterfactually-Fair-Prompt (CFP) method towards Unbiased Foundation mOdels (UFO) for fairness-aware LLM-based recommendation.
arXiv Detail & Related papers (2023-05-20T04:32:59Z)
DualFair: Fair Representation Learning at Both Group and Individual Levels via Contrastive Self-supervision [73.80009454050858]
This work presents a self-supervised model, called DualFair, that can debias sensitive attributes like gender and race from learned representations. Our model jointly optimize for two fairness criteria - group fairness and counterfactual fairness.
arXiv Detail & Related papers (2023-03-15T07:13:54Z)
On Comparing Fair Classifiers under Data Bias [42.43344286660331]
We study the effect of varying data biases on the accuracy and fairness of fair classifiers. Our experiments show how to integrate a measure of data bias risk in the existing fairness dashboards for real-world deployments.
arXiv Detail & Related papers (2023-02-12T13:04:46Z)
Learning Informative Representation for Fairness-aware Multivariate Time-series Forecasting: A Group-based Perspective [50.093280002375984]
Performance unfairness among variables widely exists in multivariate time series (MTS) forecasting models. We propose a novel framework, named FairFor, for fairness-aware MTS forecasting.
arXiv Detail & Related papers (2023-01-27T04:54:12Z)
How Robust is Your Fairness? Evaluating and Sustaining Fairness under Unseen Distribution Shifts [107.72786199113183]
We propose a novel fairness learning method termed CUrvature MAtching (CUMA) CUMA achieves robust fairness generalizable to unseen domains with unknown distributional shifts. We evaluate our method on three popular fairness datasets.
arXiv Detail & Related papers (2022-07-04T02:37:50Z)
Fair Group-Shared Representations with Normalizing Flows [68.29997072804537]
We develop a fair representation learning algorithm which is able to map individuals belonging to different groups in a single group. We show experimentally that our methodology is competitive with other fair representation learning algorithms.
arXiv Detail & Related papers (2022-01-17T10:49:49Z)
Recovering from Biased Data: Can Fairness Constraints Improve Accuracy? [11.435833538081557]
Empirical Risk Minimization (ERM) may produce a classifier that not only is biased but also has suboptimal accuracy on the true data distribution. We examine the ability of fairness-constrained ERM to correct this problem. We also consider other recovery methods including reweighting the training data, Equalized Odds, and Demographic Parity.
arXiv Detail & Related papers (2019-12-02T22:00:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.