GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning
- URL: http://arxiv.org/abs/2511.15256v1
- Date: Wed, 19 Nov 2025 09:19:39 GMT
- Title: GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning
- Authors: Yanchen Xu, Ziheng Jiao, Hongyuan Zhang, Xuelong Li,
- Abstract summary: We propose Group Relative Policy Optimization for Representation Model (GRPO-RM)<n>Our method establishes a predefined output set to functionally replace token sequence sampling in large language models (LLMs)<n>A specialized reward function is designed to accommodate the properties of representation models.
- Score: 52.16150076582931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Group Relative Policy Optimization (GRPO), a reinforcement learning method used to fine-tune large language models (LLMs), has proved its effectiveness in practical applications such as DeepSeek-R1. It raises a question whether GRPO can be generalized to representation learning models. In this paper, we propose Group Relative Policy Optimization for Representation Model (GRPO-RM), and investigate the performance of GRPO-like policy in post-training representation models. Specifically, our method establishes a predefined output set to functionally replace token sequence sampling in LLMs, thereby generating an output group, which is essential for the probability-driven optimization of GRPO. In addition, a specialized reward function is designed to accommodate the properties of representation models. Extensive experiments are conducted on various real-world datasets to validate the effectiveness of our proposed method.
Related papers
- Demystifying Group Relative Policy Optimization: Its Policy Gradient is a U-Statistic [12.256817975993128]
Relative policy optimization is a core methodological component of DeepSeekMath and DeepSeek-R1.<n>This paper provides a unified framework to understand GRPO through the lens of classical U-statistics.
arXiv Detail & Related papers (2026-03-01T15:56:43Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization [97.18886232580131]
Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration.<n>We propose Turn-Level GRPO, a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization.
arXiv Detail & Related papers (2026-01-23T06:21:33Z) - A First-Order Logic-Based Alternative to Reward Models in RLHF [0.0]
Reinforcement Learning from Human Feedback plays a crucial role in aligning large language models with human values and preferences.<n>Existing approaches rely heavily on reward models to guide language models toward human-aligned behaviors.<n>We propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling.
arXiv Detail & Related papers (2025-12-16T05:15:17Z) - Understanding Generative Recommendation with Semantic IDs from a Model-scaling View [57.471604518714535]
Generative Recommendation (GR) tries to unify rich item semantics and collaborative filtering signals.<n>One popular modern approach is to use semantic IDs (SIDs) to represent items in an autoregressive user interaction sequence modeling setup.<n>We show that SID-based GR shows significant bottlenecks while scaling up the model.<n>We revisit another GR paradigm that directly uses large language models (LLMs) as recommenders.
arXiv Detail & Related papers (2025-09-29T21:24:17Z) - GRPO is Secretly a Process Reward Model [5.637496960655903]
We show that the GRPO RL algorithm induces a non-trivial process reward model under real-world conditions.<n>We propose a simple modification to the algorithm to mitigate this defect.<n>Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO.
arXiv Detail & Related papers (2025-09-25T13:40:36Z) - Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes [55.2480439325792]
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics.<n>Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with outcomes, like scientific experiments.
arXiv Detail & Related papers (2025-08-15T20:50:53Z) - Stepwise Guided Policy Optimization: Coloring your Incorrect Reasoning in GRPO [22.00487909203855]
Group Relative Policy Optimization fails to update a policy when all responses within a group are incorrect.<n>This limitation underscores a key gap between artificial and human intelligence.<n>We introduce a simple framework that mitigates the all-negative-sample issue by incorporating response diversity within groups.
arXiv Detail & Related papers (2025-05-16T18:02:05Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - Diverse Policy Optimization for Structured Action Space [59.361076277997704]
We propose Diverse Policy Optimization (DPO) to model the policies in structured action space as the energy-based models (EBM)
A novel and powerful generative model, GFlowNet, is introduced as the efficient, diverse EBM-based policy sampler.
Experiments on ATSC and Battle benchmarks demonstrate that DPO can efficiently discover surprisingly diverse policies.
arXiv Detail & Related papers (2023-02-23T10:48:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.