Fugu-MT 論文翻訳(概要): DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

論文の概要: DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

arxiv url: http://arxiv.org/abs/2604.24357v1
Date: Mon, 27 Apr 2026 11:50:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.964214
Title: DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models
Title（参考訳）: DPRM:拡散言語モデルのための変換によるToken-Orderingモジュール
Authors: Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda,
Abstract要約: 本稿では拡散言語モデルのためのプラグイントークン順序付けモジュールDPRM(Doob h-transform Process Reward Model)を紹介する。 DPRMは信頼性駆動のプログレッシブオーダから始まり、徐々にDoobh変換プロセスリワード誘導オーダへと移行する。抽出可能な最適化の仮定の下では、DPRMはランダムおよび信頼のみの順序よりもサンプル複雑さの優位性が得られる。
参考スコア（独自算出の注目度）: 76.12556589212666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion language models generate without a fixed left-to-right order, making token ordering a central algorithmic choice: which tokens should be revealed, retained, revised or verified at each step? Existing systems mainly use random masking or confidence-driven ordering. Random masking creates train--test mismatch, while confidence-only rules are efficient but can be myopic and suppress useful exploration. We introduce DPRM (Doob h-transform Process Reward Model), a plug-in token-ordering module for diffusion language models. DPRM keeps the host architecture, denoising objective and supervision unchanged, and changes only the ordering policy. It starts from confidence-driven progressive ordering and gradually shifts to Doob h transform Process Reward guided ordering through online estimates. We characterize the exact DPRM policy as a reward-tilted Gibbs reveal law, prove O(1/N) convergence of the stagewise Soft-BoN approximation, and show that the online bucketized controller tracks the exact DPRM score at empirical-Bernstein rates. Under tractable optimization assumptions, DPRM also yields a sample-complexity advantage over random and confidence-only ordering. DPRM improves over confidence-based baselines in pretraining, post-training, test-time scaling, and single-cell masked diffusion, with particularly strong gains on harder reasoning subsets. In protein, molecular generation and DNA design, the effect is more multi-objective: ordering-aware variants significantly improve selected structural or fragment-constrained metrics while not uniformly dominating the host baseline on every quality metric. These results identify token ordering as a fundamental control axis in diffusion language models and establish DPRM as a general-purpose module for improving it. Code is available at https://github.com/DakeBU/DPRM-DLLM.
Abstract（参考訳）: 拡散言語モデルは、固定された左から右への順序なしで生成し、トークンの順序を中央のアルゴリズム選択にします。既存のシステムは、主にランダムマスキングまたは信頼駆動注文を使用する。ランダムマスキングは列車のミスマッチを生成するが、信頼のみのルールは効率的だが、筋電図であり、有用な探索を抑えることができる。本稿では拡散言語モデルのためのプラグイントークン順序付けモジュールDPRM(Doob h-transform Process Reward Model)を紹介する。 DPRMはホストアーキテクチャを維持し、目的と監督を常に監視し、順序付けポリシーだけを変更する。信頼性駆動のプログレッシブオーダから始まり、徐々にDoobh変換プロセスリワードガイドオーダへと、オンライン見積を通じて移行します。我々は,DPRMの厳密なポリシーを報奨型ギブズ法として特徴付け,段階的にSoft-BoN近似のO(1/N)収束を証明し,オンラインバケット化コントローラが経験的-ベルンシュタイン速度で正確なDPRMスコアを追跡することを示す。抽出可能な最適化の仮定の下では、DPRMはランダムおよび信頼のみの順序よりもサンプル複雑さの優位性が得られる。 DPRMは、プレトレーニング、ポストトレーニング、テストタイムスケーリング、シングルセルマスク拡散における信頼性ベースのベースラインを改善し、特に難しい推論サブセットに強く依存する。タンパク質、分子生成、DNA設計において、この効果はより多目的的である: オーダーアウェアの変異は、選択された構造的または断片的制約のあるメトリクスを著しく改善するが、全ての品質指標においてホストベースラインを均一に支配しない。これらの結果から,拡散言語モデルの基本制御軸としてトークン順序付けを同定し,DPRMを汎用モジュールとして確立した。コードはhttps://github.com/DakeBU/DPRM-DLLMで公開されている。

論文の概要: DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

関連論文リスト