Fugu-MT 論文翻訳(概要): Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

論文の概要: Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

arxiv url: http://arxiv.org/abs/2509.23924v1
Date: Sun, 28 Sep 2025 15:01:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.536069
Title: Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step
Title（参考訳）: 低復号ステップを用いた一貫性軌道強化学習によるマスケッド拡散言語モデルのモデル化
Authors: Jingyi Yang, Guanxu Chen, Xuhao Hu, Jing Shao,
Abstract要約: マスケッド拡散言語モデルは、並列復号化、フレキシブルな生成順序、推論ステップの少ないポテンシャルなどの特性を提供する。直感的なアプローチは、自己回帰(AR)言語モデルのために確立された技術を直接MDLMに転送することである。本稿では,EOS Early Rejection (EOSER) と Ascending Step-Size (ASS) デコードスケジューラを提案する。
参考スコア（独自算出の注目度）: 28.12392773921128
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked diffusion language models (MDLMs) have recently emerged as a promising alternative to autoregressive (AR) language models, offering properties such as parallel decoding, flexible generation orders, and the potential for fewer inference steps. Despite these advantages, decoding strategies and reinforcement learning (RL) algorithms tailored for MDLMs remain underexplored. A naive approach is to directly transfer techniques well-established for AR models to MDLMs. However, this raises an immediate question: Is such a naive transfer truly optimal? For example, 1) Block-wise and semi-AR decoding strategies are not employed during the training of MDLMs, so why do they outperform full diffusion-style decoding during inference? 2) Applying RL algorithms designed for AR models directly to MDLMs exhibits a training-inference inconsistency, since MDLM decoding are non-causal (parallel). This results in inconsistencies between the rollout trajectory and the optimization trajectory. To address these challenges, we propose EOS Early Rejection (EOSER) and Ascending Step-Size (ASS) decoding scheduler, which unlock the potential of MDLMs to perform full diffusion-style decoding, achieving competitive performance with fewer decoding steps. Additionally, we introduce Consistency Trajectory Group Relative Policy Optimization (CJ-GRPO) for taming MDLMs, which emphasizes the consistency between rollout trajectory and optimization trajectory, and reduces the optimization errors caused by skip-step optimization. We conduct extensive experiments on reasoning tasks, such as mathematical and planning benchmarks, using LLaDA-8B-Instruct. The results demonstrate that the proposed EOSER and ASS mechanisms, together with CJ-GRPO, hold significant promise for effectively and efficiently taming MDLMs. Code: https://github.com/yjyddq/EOSER-ASS-RL.
Abstract（参考訳）: マスク付き拡散言語モデル(MDLM)は、並列デコーディング、フレキシブルな生成順序、推論ステップの削減といった特性を提供する自動回帰言語モデル(AR)に代わる有望な代替品として最近登場した。これらの利点にもかかわらず、MDLMに適した復号化戦略と強化学習(RL)アルゴリズムは未探索のままである。直感的なアプローチは、ARモデルのために確立された技術を直接MDLMに転送することである。しかし、これはすぐに疑問を投げかけます。例えば 1)MDLMの訓練ではブロックワイド・セミAR復号法が採用されないため,推論時に完全拡散型復号法よりも優れているのか? 2) MDLMに直接ARモデル用に設計されたRLアルゴリズムを適用すると,MDLM復号化は非因果(並列)であるため,トレーニング推論の不整合を示す。これにより、ロールアウト軌道と最適化軌道の矛盾が生じる。これらの課題に対処するために、MDLMの潜在能力を解放し、より少ない復号ステップで競合性能を実現するEOS Early Rejection (EOSER) と Ascending Step-Size (ASS) デコードスケジューラを提案する。さらに,MDLMの扱いにあたり,ロールアウトトラジェクトリと最適化トラジェクトリの整合性を強調し,スキップステップ最適化による最適化誤差を低減するコンシステンシートラジェクトリグループ相対ポリシー最適化(CJ-GRPO)を導入する。 LLaDA-8B-インストラクタを用いて、数学的および計画的ベンチマークなどの推論タスクに関する広範な実験を行う。その結果,提案したEOSERとASS機構とCJ-GRPOはMDLMを効果的かつ効率的に利用するための重要な可能性を示唆した。コード:https://github.com/yjyddq/EOSER-ASS-RL。

論文の概要: Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step

関連論文リスト