Fugu-MT 論文翻訳(概要): GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

論文の概要: GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

arxiv url: http://arxiv.org/abs/2510.23868v1
Date: Mon, 27 Oct 2025 21:18:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-29 15:35:36.519367
Title: GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA
Title（参考訳）: GIFT: Group-relative Implicit Fine TuningがGRPOとDPOとUNAを統合した
Authors: Zhichao Wang,
Abstract要約: GIFTはアライメントのための新しい強化学習フレームワークである。暗黙の報酬モデルと明示的な報酬モデルとの差を最小限にする。数学的ベンチマークにおいて優れた推論とアライメント性能を達成する。
参考スコア（独自算出の注目度）: 6.07907277934348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.
Abstract（参考訳）: LLMを整合化するための新しい強化学習フレームワークである GIFT (textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning) を提案する。 PPOやGRPOといった累積報酬を直接最大化する代わりに、GIFTは暗黙の報酬モデルと明示的な報酬モデルとの差を最小限にする。 1)オンラインマルチレスポンス生成とGRPOの正規化,(2)DPOの暗黙の報酬定式化,(3)UNAの暗黙の報酬アライメント原理の3つの主要な考え方が組み合わさっている。暗黙の報酬と明示的な報酬を共同で正規化することにより、GIFTは暗黙の報酬を効果的に使用するのを防ぐ、他の難解な用語を排除している。この正規化は、複素報酬最大化目標を正規化された報酬関数間の単純な平均二乗誤差(MSE)損失に変換し、非凸最適化問題を凸、安定、解析的に微分可能な定式化に変換する。 DPOやUNAのようなオフラインの方法とは異なり、GIFTは引き続き政治上にあり、探索能力を維持している。 GRPOと比較して、ハイパーパラメータを少なくし、より高速に収束し、トレーニングオーバーフィッティングを大幅に減らして、より良く一般化する。経験的に、GIFTは計算効率を保ちながら、数学的ベンチマークにおいて優れた推論とアライメント性能を達成する。

論文の概要: GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

関連論文リスト