Fugu-MT 論文翻訳(概要): MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

論文の概要: MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

arxiv url: http://arxiv.org/abs/2508.13148v1
Date: Mon, 18 Aug 2025 17:58:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:11.52448
Title: MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models
Title（参考訳）: MDPO:masked Diffusion Language Modelのトレーニング推論ディバイドを克服する
Authors: Haoyu He, Katrin Renz, Yong Cao, Andreas Geiger,
Abstract要約: 拡散言語モデルは、トレーニングと推論の主な相違に悩まされる。本稿では,マルコフ特性拡散を利用するためのMasked Diffusion Policy Optimization (MDPO)を提案する。本研究は,MDLMの事前学習と推測の相違を調査するための大きな可能性を見出した。
参考スコア（独自算出の注目度）: 32.21165055067441
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This simple yet effective training-free strategy, what we refer to as RCR, consistently improves performance and yields additional gains when combined with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: https://github.com/autonomousvision/mdpo. Project Page: https://cli212.github.io/MDPO/.
Abstract（参考訳）: 拡散言語モデルは、従来の自己回帰(AR)モデルに代わる有望な代替として、双方向コンテキストでのより高速な生成とリッチな条件付けを可能にする。しかし、これらはトレーニングと推論の主な相違に悩まされる:推論中、MDLMは、マスク付きトークンを減らして生成シーケンスの構造を徐々に明らかにするが、トークンがランダムにマスクされているため、この構造はトレーニングでは無視される。このトレーニングと推論の相違は、最適以下のパフォーマンスをもたらす可能性があるが、以前の研究ではほとんど見過ごされ、この2つのステージ間のギャップを埋めることがオープンな問題となっている。そこで我々は,効果的に軌道を判断する問題を逐次的決定問題として認識し,その結果の枠組みを用いて強化学習を行う。本稿では,マルコフ特性の拡散を利用したMDPO(Masked Diffusion Policy Optimization)を提案する。 MDPOは従来の最新式SOTA(State-of-the-art)法のパフォーマンスと60倍の勾配更新を達成し、MATH500では9.6%、SOTAでは54.2%の改善を達成した。さらに,MDLMのリメイキング戦略をプラグイン推論の代替として改善し,モデルがフレキシブルにトークンを洗練できない限界を克服する。 RCRと呼ばれるこのシンプルで効果的なトレーニングフリー戦略は、パフォーマンスを継続的に改善し、MDPOと組み合わせることでさらに利益を得る。本研究は,MDLMの事前学習と推測の相違を調査するための大きな可能性を見出した。コード:https://github.com/autonomousvision/mdpo。プロジェクトページ: https://cli212.github.io/MDPO/。

論文の概要: MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models

関連論文リスト