Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models
- URL: http://arxiv.org/abs/2511.18159v1
- Date: Sat, 22 Nov 2025 19:04:47 GMT
- Title: Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models
- Authors: Mengni Jia, Mengyu Zhou, Yihao Liu, Xiaoxi Jiang, Guanjun Jiang,
- Abstract summary: Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs)<n>High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs often diverge after task-specific training.<n>We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise.
- Score: 8.964977926797173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs), but they suffer from inherently much higher training variance. High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs that are competitive at initialization often diverge after task-specific training, with MDMs falling far behind. There has been no theoretical explanation or systematic solution. We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise, while ARMs are only affected by (C). This explains the fundamental training gap. Building on this foundation, we design six variance-reduction methods, including two core methods: (1) P-POTS, a Pareto-optimal t sampler that minimizes training variance by sampling harder t values more often with appropriately smaller update steps, and (2) MIRROR, which uses negatively correlated samples to reduce (A). Experiments show that compared to standard MDM training, our methods improve accuracy by 7-8% on complex reasoning tasks, while simultaneously reducing run-to-run variability to near ARM levels, substantially narrowing the gap with strong ARM baselines; in most settings, even the best baseline runs remain below the worst run of our method.
Related papers
- MDiff4STR: Mask Diffusion Model for Scene Text Recognition [59.79818820650126]
Mask Diffusion Models (MDMs) have emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks.<n>We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency.<n>We propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for Scene Text Recognition.
arXiv Detail & Related papers (2025-12-01T08:57:51Z) - Score-based Membership Inference on Diffusion Models [3.742113529511043]
Membership inference attacks (MIAs) against diffusion models have emerged as a pressing privacy concern.<n>We present a theoretical and empirical study of score-based MIAs, focusing on the predicted noise vectors that diffusion models learn to approximate.<n>We show that the expected denoiser output points toward a kernel-weighted local mean of nearby training samples, such that its norm encodes proximity to the training set and thereby reveals membership.
arXiv Detail & Related papers (2025-09-29T16:28:55Z) - MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models [28.79185891706149]
Diffusion language models suffer from a key discrepancy between training and inference.<n>We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion.<n>Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs.
arXiv Detail & Related papers (2025-08-18T17:58:13Z) - Improved Diffusion-based Generative Model with Better Adversarial Robustness [65.38540020916432]
Diffusion Probabilistic Models (DPMs) have achieved significant success in generative tasks.<n>During the denoising process, the input data distributions differ between the training and inference stages.
arXiv Detail & Related papers (2025-02-24T12:29:16Z) - Score-of-Mixture Training: Training One-Step Generative Models Made Simple via Score Estimation of Mixture Distributions [3.347388046213879]
We propose Score-of-Mixture Training (SMT), a novel framework for training one-step generative models.<n>SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels.<n>Our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we call Score-of-Mixture Distillation (SMD)
arXiv Detail & Related papers (2025-02-13T18:57:20Z) - Model Inversion Attacks Through Target-Specific Conditional Diffusion Models [54.69008212790426]
Model inversion attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications.
Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space.
We propose Diffusion-based Model Inversion (Diff-MI) attacks to alleviate these issues.
arXiv Detail & Related papers (2024-07-16T06:38:49Z) - Adaptive Training Meets Progressive Scaling: Elevating Efficiency in Diffusion Models [52.1809084559048]
We propose a novel two-stage divide-and-conquer training strategy termed TDC Training.<n>It groups timesteps based on task similarity and difficulty, assigning highly customized denoising models to each group, thereby enhancing the performance of diffusion models.<n>While two-stage training avoids the need to train each model separately, the total training cost is even lower than training a single unified denoising model.
arXiv Detail & Related papers (2023-12-20T03:32:58Z) - One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion
Schedule Flaws and Enhancing Low-Frequency Controls [77.42510898755037]
One More Step (OMS) is a compact network that incorporates an additional simple yet effective step during inference.
OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters.
Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
arXiv Detail & Related papers (2023-11-27T12:02:42Z) - Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood [64.95663299945171]
Training energy-based models (EBMs) on high-dimensional data can be both challenging and time-consuming.
There exists a noticeable gap in sample quality between EBMs and other generative frameworks like GANs and diffusion models.
We propose cooperative diffusion recovery likelihood (CDRL), an effective approach to tractably learn and sample from a series of EBMs.
arXiv Detail & Related papers (2023-09-10T22:05:24Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.