Fugu-MT 論文翻訳(概要): Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

論文の概要: Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

arxiv url: http://arxiv.org/abs/2606.08501v1
Date: Sun, 07 Jun 2026 07:59:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.16285
Title: Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models
Title（参考訳）: バック・オン・トラック:拡散型大言語モデルにおける推論のためのリワードと状態の調整
Authors: Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Hongchen Luo, Xueyang Fu, Yang Cao, Wei Zhai, Zheng-Jun Zha,
Abstract要約: Process Aligned Policy Optimization (PAPO) は、RL更新をdLLMの生成軌道と整合させる新しいフレームワークである。 PAPOはスパース端末の報酬を、高不確実なステップで真の軌道を再生するエントロピー誘導歴史再生(EHR)に変換する。
参考スコア（独自算出の注目度）: 90.09182925511317
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM's generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.
Abstract（参考訳）: 強化学習(RL)は拡散大言語モデル(dLLM)の推論能力を高めるための大きな約束を持っている。しかし、進行は、真の生成軌道と勾配更新過程の2つのミスアライメントによって、基本的に制限されている。 (i)プロセス・リワードの誤認スパース、ターミナル報酬は、生成プロセスのすべての中間ステップに無差別に割り当てられ、差別的な信用代入を提供しない。 (二)国家軌跡の誤認政策の更新は、しばしば、あまり情報に乏しいサンプルの勾配を揺るがす、人工的な軌道外状態に向けられる。これらの制限に対処するため,プロセスアラインド・ポリシー・オプティマイズ(PAPO, Process Aligned Policy Optimization)という,RL更新をステップ・アウェア・プロセス・リワード(SPR)を通じて,ステップ・アウェア・プロセス・リワード(Step-Aware Process Rewards, SPR)を通じて,ステップ・アウェア・プロセス・リジェクティクス(SPR)を通じて,高確実なステップで真正なトラジェクティクスを再生するエントロピー・ガイド・ヒストリー・リダクティカル・リダクティメント(EHR)に変換する,新たなフレームワークを導入する。 4つのベンチマークにおいて、PAPOはベースラインを大幅に上回り、GSM8Kが4.5%、MATH500が4.8%、カウントダウンが42.2%、スドクが16.1%まで上昇した。

論文の概要: Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

関連論文リスト