GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA
- URL: http://arxiv.org/abs/2510.23868v1
- Date: Mon, 27 Oct 2025 21:18:19 GMT
- Title: GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA
- Authors: Zhichao Wang,
- Abstract summary: GIFT is a novel reinforcement learning framework for alignings.<n>It minimizes discrepancy between implicit and explicit reward models.<n>It achieves superior reasoning and alignment performance on mathematical benchmarks.
- Score: 6.07907277934348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations [53.61624356747686]
Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention.<n>Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations.<n>We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution.
arXiv Detail & Related papers (2026-02-01T21:49:30Z) - Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning [60.00161035836637]
Group Relative Policy Optimization has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks.<n>We introduce Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer.<n>OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline.
arXiv Detail & Related papers (2026-01-12T10:48:02Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - AMIR-GRPO: Inducing Implicit Preference Signals into GRPO [15.759757442328388]
Reinforcement learning has become the primary paradigm for aligning large language models on complex reasoning tasks.<n> GRPO is widely used in large-scale post-training but faces structural limitations in reasoning-heavy settings.<n>AMIR-GRPO augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings.
arXiv Detail & Related papers (2026-01-07T07:22:58Z) - FedLoDrop: Federated LoRA with Dropout for Generalized LLM Fine-tuning [65.26899091946417]
Fine-tuning large language models (LLMs) is crucial for adapting general-purpose models to specific tasks.<n>This paper proposes Federated LoRA with Dropout (FedLoDrop), a new framework that applies dropout to the rows and columns of the trainable matrix in Federated LoRA.
arXiv Detail & Related papers (2025-10-14T02:40:45Z) - FlowRL: Matching Reward Distributions for LLM Reasoning [69.88820066093798]
We propose FlowRL: matching the full reward distribution via flow balancing instead of maximizing rewards in large language model (LLM) reinforcement learning (RL)<n>We transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution.
arXiv Detail & Related papers (2025-09-18T17:56:36Z) - A Principled Loss Function for Direct Language Model Alignment [0.0]
We propose a novel loss function derived directly from the RLHF optimality condition.<n>Our proposed loss targets a specific finite value for the logits, which is dictated by the underlying reward, rather than its difference.<n>This inherent stability prevents reward hacking and leads to more effective alignment.
arXiv Detail & Related papers (2025-08-10T01:56:58Z) - Explicit Preference Optimization: No Need for an Implicit Reward Model [18.225409932618657]
Direct preference optimization (DPO) and its offshoots circumvent the need for a separate reward training step.<n>We show that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive artifacts.
arXiv Detail & Related papers (2025-06-09T07:11:01Z) - Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification [10.617854230082896]
Group Relative Policy Optimization was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards.<n>We analyze variants that differ in reward normalization (mean-only vs mean + variance) and in how they regularize updates using KL divergence.
arXiv Detail & Related papers (2025-03-09T14:36:45Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.<n>To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.<n>Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.