Group-Aware Reinforcement Learning for Output Diversity in Large Language Models
- URL: http://arxiv.org/abs/2511.12596v1
- Date: Sun, 16 Nov 2025 13:42:55 GMT
- Title: Group-Aware Reinforcement Learning for Output Diversity in Large Language Models
- Authors: Oron Anschel, Alon Shoshan, Adam Botach, Shunit Haviv Hakimi, Asaf Gendler, Emanuel Ben Baruch, Nadav Bhonker, Igor Kviatkovsky, Manoj Aggarwal, Gerard Medioni,
- Abstract summary: Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist.<n>We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole.<n>We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses.
- Score: 8.356950556877612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) often suffer from mode collapse, repeatedly generating the same few completions even when many valid answers exist, limiting their diversity across a wide range of tasks. We introduce Group-Aware Policy Optimization (GAPO), a simple extension of the recent and popular Group Relative Policy Optimization (GRPO) that computes rewards over the group as a whole. GAPO enables learning from the group-level properties such as diversity and coverage. We demonstrate GAPO using a frequency-aware reward function that encourages uniform sampling over valid LLM completions, and show that GAPO-trained models produce valid and more diverse model responses. Beyond this setup, GAPO generalizes to open-ended prompts and improves response diversity without compromising accuracy on standard LLM benchmarks (GSM8K, MATH, HumanEval, MMLU-Pro). Our code will be made publicly available.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning [52.16150076582931]
We propose Group Relative Policy Optimization for Representation Model (GRPO-RM)<n>Our method establishes a predefined output set to functionally replace token sequence sampling in large language models (LLMs)<n>A specialized reward function is designed to accommodate the properties of representation models.
arXiv Detail & Related papers (2025-11-19T09:19:39Z) - GAPO: Group Adaptive Policy Optimization for Real-World Code Edit [18.191276089029607]
Group Adaptive Policy Optimization (GAPO) finds an outlier-free highest-density interval (HDI) per prompt and uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation.<n>GAPO robustly handles skewed distributions while remaining plug-and-play and efficient.<n>We validate GAPO on nine instruction-tuned LLMs (3B-14B) using a large internal dataset of 51,844 real-world, history-aware code-editing tasks.
arXiv Detail & Related papers (2025-10-22T03:37:49Z) - Understanding Generative Recommendation with Semantic IDs from a Model-scaling View [57.471604518714535]
Generative Recommendation (GR) tries to unify rich item semantics and collaborative filtering signals.<n>One popular modern approach is to use semantic IDs (SIDs) to represent items in an autoregressive user interaction sequence modeling setup.<n>We show that SID-based GR shows significant bottlenecks while scaling up the model.<n>We revisit another GR paradigm that directly uses large language models (LLMs) as recommenders.
arXiv Detail & Related papers (2025-09-29T21:24:17Z) - GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning [57.888537648437115]
Group Context Optimization (GroupCoOp) is a simple and effective debiased fine-tuning algorithm.<n>It enhances the group robustness of fine-tuned vision-language models (VLMs)<n>GroupCoOp achieved the best results on five benchmarks across five CLIP architectures.
arXiv Detail & Related papers (2025-09-28T09:54:30Z) - Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs [77.22973302887435]
Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs)<n>We present mmGRPO, a simple multi-module of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories.<n>We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks.
arXiv Detail & Related papers (2025-08-06T17:28:31Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning [106.98018881499362]
We introduce GEPA (Genetic-Pareto), a prompt that thoroughly incorporates natural language to learn high-level rules from trial and error.<n>GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems.<n>It can often turn even just a few rollouts into a large quality gain.
arXiv Detail & Related papers (2025-07-25T17:42:32Z) - The Pitfalls of Growing Group Complexity: LLMs and Social Choice-Based Aggregation for Group Recommendations [2.6470894980840525]
Group Recommender Systems (GRS) often used social choice-based aggregation strategies to derive a single recommendation.<n>We investigate under which conditions language models can perform these strategies correctly based on zero-shot learning.<n>We show that performance starts to deteriorate when considering more than 100 ratings.<n>We conclude that future research should include group complexity as a factor in GRS evaluation.
arXiv Detail & Related papers (2025-05-08T07:43:01Z) - Group Preference Optimization: Few-Shot Alignment of Large Language Models [28.464834028110538]
Group Preference Optimization steers language models to preferences of individual groups in a few-shot manner.
We empirically validate the efficacy of GPO through rigorous evaluations using large language models with varied sizes.
Our results demonstrate that GPO not only aligns models more accurately but also requires fewer group-specific preferences, and less training and inference computing resources.
arXiv Detail & Related papers (2023-10-17T18:41:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.