A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs
- URL: http://arxiv.org/abs/2512.08786v2
- Date: Mon, 15 Dec 2025 19:37:04 GMT
- Title: A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs
- Authors: Mahmoud Srewa, Tianyu Zhao, Salma Elmalaki,
- Abstract summary: This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning environments.<n>We introduce a comprehensive evaluation framework that assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences.
- Score: 2.840505903487544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses the challenge of aligning large language models (LLMs) with diverse human preferences within federated learning (FL) environments, where standard methods often fail to adequately represent diverse viewpoints. We introduce a comprehensive evaluation framework that systematically assesses the trade-off between alignment quality and fairness when using different aggregation strategies for human preferences. In our federated setting, each group locally evaluates rollouts and produces reward signals, and the server aggregates these group-level rewards without accessing any raw data. Specifically, we evaluate standard reward aggregation techniques (min, max, and average) and introduce a novel adaptive scheme that dynamically adjusts preference weights based on a group's historical alignment performance. Our experiments on question-answering (Q/A) tasks using a PPO-based RLHF pipeline demonstrate that our adaptive approach consistently achieves superior fairness while maintaining competitive alignment scores. This work offers a robust methodology for evaluating LLM behavior across diverse populations and provides a practical solution for developing truly pluralistic and fairly aligned models.
Related papers
- GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - GroupRank: A Groupwise Reranking Paradigm Driven by Reinforcement Learning [26.616849067985967]
Groupwise is a novel reranking paradigm for large language models.<n>We propose an innovative pipeline for high quality retrieval and ranking data.<n>The resulting data can be leveraged not only for training the reranker but also for training the retriever.
arXiv Detail & Related papers (2025-11-10T15:25:31Z) - DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data [65.09939942413651]
We propose a principled extension to GRPO that addresses inter-group imbalance with two key innovations.<n> Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence.<n>Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value.
arXiv Detail & Related papers (2025-05-21T03:43:29Z) - In-context Ranking Preference Optimization [65.5489745857577]
We propose an In-context Ranking Preference Optimization (IRPO) framework to optimize large language models (LLMs) based on ranking lists constructed during inference.<n>We show IRPO outperforms standard DPO approaches in ranking performance, highlighting its effectiveness in aligning LLMs with direct in-context ranking preferences.
arXiv Detail & Related papers (2025-04-21T23:06:12Z) - Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z) - Pareto-Optimal Learning from Preferences with Hidden Context [17.590330740964266]
We propose POPL, which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs.<n>Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies.<n>We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness.
arXiv Detail & Related papers (2024-06-21T18:57:38Z) - Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.<n>The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.<n>We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Generalizing Reward Modeling for Out-of-Distribution Preference Learning [3.9160947065896803]
Preference learning with large language models (LLMs) aims to align the LLMs' generations with human preferences.
Due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging.
This work addresses OOD PL by optimizing a general reward model through a meta-learning approach.
arXiv Detail & Related papers (2024-02-22T18:20:33Z) - Re-weighting Based Group Fairness Regularization via Classwise Robust
Optimization [30.089819400033985]
We propose a principled method, dubbed as ours, which unifies the two learning schemes by incorporating a well-justified group fairness metric into the training objective.
We develop an iterative optimization algorithm that minimizes the resulting objective by automatically producing the correct re-weights for each group.
Our experiments show that FairDRO is scalable and easily adaptable to diverse applications.
arXiv Detail & Related papers (2023-03-01T12:00:37Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.