Consensus Group Relative Policy Optimization for Text Generation
- URL: http://arxiv.org/abs/2602.03102v1
- Date: Tue, 03 Feb 2026 04:54:40 GMT
- Title: Consensus Group Relative Policy Optimization for Text Generation
- Authors: Yuki Ichihara, Yuu Jinnai, Kaito Ariu, Eiji Uchibe,
- Abstract summary: C-GRPO distills Minimum Bayes Risk (MBR) decoding into training by formulating the consensus utility as a group-relative objective.<n>Under ideal conditions, we show that the objective function of C-GRPO is directionally aligned with the gradient of the expected-utility objective.<n>Experiments on machine translation (WMT 2024) and text summarization (XSum) demonstrate that C-GRPO successfully achieves performance comparable to MBR decoding.
- Score: 26.428687964354605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many strong decoding methods for text generation follow a sample-and-rerank paradigm: they draw multiple candidates, score each under a utility (reward) function using consensus across samples, and return the best one. Although effective, these methods incur high computational costs during inference due to repeated sampling and scoring. Prior attempts to amortize inference-time computation typically rely on gold references, teacher labels, or curated preference data, increasing dataset construction effort and the demand for high-fidelity reward models. We propose Consensus Group Relative Policy Optimization (C-GRPO), which distills Minimum Bayes Risk (MBR) decoding into training by formulating the consensus utility as a group-relative objective within GRPO. C-GRPO requires only a utility function and policy samples, without gold references or explicit preference labels. Under ideal conditions, we show that the objective function of C-GRPO is directionally aligned with the gradient of the expected-utility objective underlying MBR decoding, leading to a convergence guarantee. Experiments on machine translation (WMT 2024) and text summarization (XSum) demonstrate that C-GRPO successfully achieves performance comparable to MBR decoding without the associated inference-time overhead, while outperforming reference-free baseline methods.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization [7.961090665261694]
We propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization.<n>By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment.
arXiv Detail & Related papers (2026-03-03T07:45:40Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization [97.18886232580131]
Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration.<n>We propose Turn-Level GRPO, a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization.
arXiv Detail & Related papers (2026-01-23T06:21:33Z) - Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLMs [12.75200353208858]
Owen-Shapley Policy Optimization (OSPO) is a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes.<n>Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit.<n> Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines.
arXiv Detail & Related papers (2026-01-13T10:17:46Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - On the Limits of Test-Time Compute: Sequential Reward Filtering for Better Inference [71.09125259964684]
Test-time compute (TTC) has become an increasingly prominent paradigm for enhancing large language models (LLMs)<n>We study reward-filtered sequential inference, a simple procedure that selectively incorporates only high-reward generations into the context.<n>On the theoretical side, we show that reward-filtered sequential inference yields strictly stronger guarantees than standard TTC paradigms.
arXiv Detail & Related papers (2025-12-04T08:21:33Z) - CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning [34.38636514331703]
CLaRa is a unified framework that performs embedding-based compression and joint optimization in a shared continuous space.<n> Experiments show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.
arXiv Detail & Related papers (2025-11-24T00:11:14Z) - GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning [52.16150076582931]
We propose Group Relative Policy Optimization for Representation Model (GRPO-RM)<n>Our method establishes a predefined output set to functionally replace token sequence sampling in large language models (LLMs)<n>A specialized reward function is designed to accommodate the properties of representation models.
arXiv Detail & Related papers (2025-11-19T09:19:39Z) - Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z) - Improving Consistency in Retrieval-Augmented Systems with Group Similarity Rewards [14.535325886547112]
RAG systems are increasingly deployed in high-stakes domains where users expect outputs to be consistent across semantically equivalent queries.<n>Existing systems often exhibit significant inconsistencies due to variability in both the retriever and generator.<n>We introduce a principled evaluation framework that decomposes RAG consistency into retriever-level, generator-level, and end-to-end components.
arXiv Detail & Related papers (2025-10-05T23:14:13Z) - Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach [2.8626097661711394]
Reinforcement Learning from Human Feedback has achieved notable success in steering models, but is complex and can be unstable.<n>Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives.<n>We propose a Group Relative Policy Optimization framework with a multi-label reward regression model to achieve safe and aligned language generation.
arXiv Detail & Related papers (2025-03-26T05:50:33Z) - Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.