Fugu-MT 論文翻訳(概要): DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning

論文の概要: DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.02212v1
Date: Thu, 02 Oct 2025 16:57:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:21.231861
Title: DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning
Title（参考訳）: DiFFPO: 強化学習を通した高速で汚い拡散型LEMの学習
Authors: Hanyang Zhao, Dawen Liang, Wenpin Tang, David Yao, Nathan Kallus,
Abstract要約: マスク付き拡散大言語モデル (dLLM) を学習し, より優れた推論を行うための統一フレームワークを提案する。我々はまず,既存の基本方針を,真のdLLM政策の近似としてはるかに難易度の高い,政治外RLによるサロゲート政策の訓練により統一する。 RLでは、各プロンプトに対して推論閾値を適応的に割り当てることによって、dLLMの自然なマルチトークン予測能力をインセンティブ化する。
参考スコア（独自算出の注目度）: 37.20873499361773
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose DiFFPO, Diffusion Fast and Furious Policy Optimization, a unified framework for training masked diffusion large language models (dLLMs) to reason not only better (furious), but also faster via reinforcement learning (RL). We first unify the existing baseline approach such as d1 by proposing to train surrogate policies via off-policy RL, whose likelihood is much more tractable as an approximation to the true dLLM policy. This naturally motivates a more accurate and informative two-stage likelihood approximation combined with importance sampling correction, which leads to generalized RL algorithms with better sample efficiency and superior task performance. Second, we propose a new direction of joint training efficient samplers/controllers of dLLMs policy. Via RL, we incentivize dLLMs' natural multi-token prediction capabilities by letting the model learn to adaptively allocate an inference threshold for each prompt. By jointly training the sampler, we yield better accuracies with lower number of function evaluations (NFEs) compared to training the model only, obtaining the best performance in improving the Pareto frontier of the inference-time compute of dLLMs. We showcase the effectiveness of our pipeline by training open source large diffusion language models over benchmark math and planning tasks.
Abstract（参考訳）: マスク付き拡散大言語モデル(dLLM)をトレーニングするための統合フレームワークであるDiFFPO, Diffusion Fast, Furious Policy Optimizationを提案する。我々はまず、d1のような既存のベースラインアプローチを、真のdLLMポリシーの近似としてずっと魅力的である、非政治的なRLを介して代理政策を訓練することを提案する。これにより、より正確で情報的な2段階確率近似と重要サンプリング補正が組み合わさって、サンプル効率が良く、タスク性能も優れている一般化されたRLアルゴリズムが実現される。第2に、dLLMsポリシーの効率的なサンプル/コントローラを共同訓練する新たな方向性を提案する。 RLでは、各プロンプトに対して推論閾値を適応的に割り当てることで、dLLMsの自然なマルチトークン予測能力をインセンティブ化する。サンプルを共同でトレーニングすることにより,モデルのみのトレーニングに比べ,関数評価(NFE)の少ない精度が向上し,dLLMの推論時間計算のParetoフロンティアを改善する上で最高の性能が得られる。ベンチマークや計画タスクよりも,オープンソースの大規模拡散言語モデルをトレーニングすることで,パイプラインの有効性を実証する。

論文の概要: DiFFPO: Training Diffusion LLMs to Reason Fast and Furious via Reinforcement Learning

関連論文リスト