Fugu-MT 論文翻訳(概要): DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

論文の概要: DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

arxiv url: http://arxiv.org/abs/2604.13902v1
Date: Wed, 15 Apr 2026 14:12:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.574276
Title: DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
Title（参考訳）: DiPO:細粒度探査・膨張貿易オフのための絡み合った複雑度政策最適化
Authors: Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng, Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie,
Abstract要約: Reinforcement Learning with Verifiable Rewards (RLVR)は、Large Language Models (LLMs)の推論能力に大きな進歩をもたらした。本稿では, 訓練中の非常に硬く, 容易なサンプルの探索と利用のジレンマを十分に分析し, 新たな微細なトレードオフ機構を提案する。
参考スコア（独自算出の注目度）: 87.58233482504308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR) は、Large Language Models (LLMs) の推論能力に大きな進歩をもたらした。しかし、探検と搾取のトレードオフを効果的に管理することは依然として重要な課題である。本稿では, 訓練中の非常に硬く, 容易なサンプルの探索と利用のジレンマを解析し, 新たな微細なトレードオフ機構を提案する。具体的には,標本空間を別々に探索(高いパープレキシティ)と利用(低いパープレキシティ)のサブスペースに分割し,探索・探索のトレードオフを必要とする微細な試料をマイニングするパープレキシティ空間ディエンタング戦略を導入する。次に、より安定した政策最適化を実現するために、両方向の報酬配分機構を提案し、検証報酬に最小限の影響を与える。最後に,本手法を数学的推論と関数呼び出しという2つの主要なタスクで評価し,提案手法の優位性を実証した。

論文の概要: DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

関連論文リスト