FuguReport

DPRM: A Plug-in Doob h transform-induced Token-Ordering Module for Diffusion Language Models

Authors Dake Bu, Wei Huang, Andi Han, Hau-San Wong, Qingfu Zhang, Taiji Suzuki, Atsushi Nitanda
Affiliations The University of Sydney / A*STAR / Nanyang Technological University / City University of Hong Kong / RIKEN / The University of Tokyo / The Institute of Statistical Mathematics
Categories Method / Token Ordering / Doob h-transform-based ordering module, Method / Diffusion Models / Token ordering in diffusion language models, Theory / Sample Complexity / Optimization under extractable assumptions
License CC BY 4.0

Abstract Overview

This paper introduces DPRM (Doob h-transform Process Reward Model), a plug-in token-ordering module for diffusion language models that modifies only the ordering policy while leaving the host model architecture, denoising objective, and supervision unchanged. DPRM begins with confidence-based progressive ordering (aligned between training and inference) and gradually transitions to Doob h-transform-inspired, process-reward-guided ordering using online bucketized reward estimates and shortlist-based Soft-BoN reweighting. The paper provides theoretical analysis characterizing the exact DPRM policy as a reward-tilted Gibbs reveal law, proves O(1/N) convergence for the Soft-BoN approximation, establishes online tracking guarantees at empirical-Bernstein rates, and shows sample-complexity advantages under stated assumptions. Experiments across seven host settings—spanning natural-language pretraining, reasoning post-training, test-time scaling, and scientific discrete diffusion tasks (protein, single-cell, molecular, DNA)—demonstrate improvements on multiple benchmarks along with domain-specific trade-offs.

Novelty

The main novelty is a Doob h-transform-inspired token-ordering controller for diffusion language models that can be inserted into existing systems without changing the host model or training objective. A distinctive aspect is the staged design: it starts from train–test-aligned confidence ordering and transitions to online process-reward guidance via bucketized reward estimates and Soft-BoN reweighting, accompanied by theoretical guarantees for the exact reward-tilted Gibbs policy, its O(1/N) Soft-BoN approximation, online tracking convergence, and sample-complexity separation under stated assumptions.

Results

In natural-language settings, DPRM-PUMA improves GSM8K validation mean from 29.34 to 34.27, DMPO-DPRM improves MATH Hard from 44.3 to 47.9 and Countdown Hard from 29.6 to 33.4, and DPRM-Prism improves voted GSM8K accuracy from 82.41 to 83.85 (though with higher NFE, increasing from 609 to 1,071). In scientific domains, results are more mixed: DPRM-DCM shows strong gains in token recovery (63.97% to 75.92%) and zero-expression accuracy (78.39% to 99.90%), while protein, molecular, and DNA experiments show that ordering-aware variants can improve selected metrics (e.g., forward-folding RMSD, linker validity, HepG2 scores) but do not uniformly dominate every quality measure.

Key Points

  1. DPRM is a plug-in token-ordering module that preserves the host architecture, loss, and data pipeline while changing only the ordering controller, transitioning from confidence-based ordering to online reward-guided ordering as bucket-level reward estimates become reliable.
  2. The theoretical framework characterizes DPRM as a reward-tilted Gibbs reveal law, proves O(1/N) convergence for the stagewise Soft-BoN approximation, establishes online tracking at empirical-Bernstein rates (up to bias from bucket coarsening, warmup, and nonstationarity), and shows sample-complexity advantages under tractable but indirectly validated assumptions.
  3. Across seven host settings, DPRM improves multiple natural-language reasoning benchmarks and yields strong gains on single-cell gene-expression diffusion, while protein, molecular, and DNA experiments reveal domain-specific trade-offs rather than universal gains on every metric.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.