MDiff4STR: Mask Diffusion Model for Scene Text Recognition
- URL: http://arxiv.org/abs/2512.01422v1
- Date: Mon, 01 Dec 2025 08:57:51 GMT
- Title: MDiff4STR: Mask Diffusion Model for Scene Text Recognition
- Authors: Yongkun Du, Miaomiao Zhao, Songlin Fan, Zhineng Chen, Caiyan Jia, Yu-Gang Jiang,
- Abstract summary: Mask Diffusion Models (MDMs) have emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks.<n>We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency.<n>We propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for Scene Text Recognition.
- Score: 59.79818820650126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mask Diffusion Models (MDMs) have recently emerged as a promising alternative to auto-regressive models (ARMs) for vision-language tasks, owing to their flexible balance of efficiency and accuracy. In this paper, for the first time, we introduce MDMs into the Scene Text Recognition (STR) task. We show that vanilla MDM lags behind ARMs in terms of accuracy, although it improves recognition efficiency. To bridge this gap, we propose MDiff4STR, a Mask Diffusion model enhanced with two key improvement strategies tailored for STR. Specifically, we identify two key challenges in applying MDMs to STR: noising gap between training and inference, and overconfident predictions during inference. Both significantly hinder the performance of MDMs. To mitigate the first issue, we develop six noising strategies that better align training with inference behavior. For the second, we propose a token-replacement noise mechanism that provides a non-mask noise type, encouraging the model to reconsider and revise overly confident but incorrect predictions. We conduct extensive evaluations of MDiff4STR on both standard and challenging STR benchmarks, covering diverse scenarios including irregular, artistic, occluded, and Chinese text, as well as whether the use of pretraining. Across these settings, MDiff4STR consistently outperforms popular STR models, surpassing state-of-the-art ARMs in accuracy, while maintaining fast inference with only three denoising steps. Code: https://github.com/Topdu/OpenOCR.
Related papers
- Relaxing Positional Alignment in Masked Diffusion Language Models [6.511565218210195]
Masked diffusion language models (MDLMs) have emerged as a promising alternative to dominant autoregressive approaches.<n>We show that strict positional prediction makes MDLM decoding highly sensitive to token misalignment.<n>We apply this approach to the widely used MDLM model and conduct experiments on five open-ended text generation benchmarks.
arXiv Detail & Related papers (2026-01-30T13:09:21Z) - SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z) - Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed [76.49335677120031]
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation.<n>We study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy.
arXiv Detail & Related papers (2025-12-16T04:12:17Z) - Masked Diffusion Models are Secretly Learned-Order Autoregressive Models [21.17429712617749]
We show that Masked Diffusion Models can identify and optimize for a decoding order during training.<n>We prove that the MDM objective decomposes precisely into a weighted auto-regressive losses over these orders.
arXiv Detail & Related papers (2025-11-24T14:17:56Z) - Bringing Stability to Diffusion: Decomposing and Reducing Variance of Training Masked Diffusion Models [8.964977926797173]
Masked diffusion models (MDMs) are a promising alternative to autoregressive models (ARMs)<n>High variance leads to noisier gradient estimates and unstable optimization, so even equally strong pretrained MDMs and ARMs often diverge after task-specific training.<n>We derive the first decomposition of MDM training variance into three sources: (A) masking pattern noise, (B) masking rate noise, and (C) data noise.
arXiv Detail & Related papers (2025-11-22T19:04:47Z) - MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models [28.79185891706149]
Diffusion language models suffer from a key discrepancy between training and inference.<n>We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion.<n>Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs.
arXiv Detail & Related papers (2025-08-18T17:58:13Z) - Anchored Diffusion Language Model [39.17770765212062]
We introduce the Anchored Diffusion Language Model (ADLM), a novel framework that predicts distributions over important tokens via an anchor network.<n>ADLM significantly improves test perplexity on LM1B and OpenWebText, achieving up to 25.4% gains over prior DLMs.<n>It also surpasses AR models in MAUVE score, which marks the first time a DLM generates better human-like text than an AR model.
arXiv Detail & Related papers (2025-05-24T01:34:14Z) - Model Inversion Attacks Through Target-Specific Conditional Diffusion Models [54.69008212790426]
Model inversion attacks (MIAs) aim to reconstruct private images from a target classifier's training set, thereby raising privacy concerns in AI applications.
Previous GAN-based MIAs tend to suffer from inferior generative fidelity due to GAN's inherent flaws and biased optimization within latent space.
We propose Diffusion-based Model Inversion (Diff-MI) attacks to alleviate these issues.
arXiv Detail & Related papers (2024-07-16T06:38:49Z) - Improving Adversarial Robustness of Masked Autoencoders via Test-time
Frequency-domain Prompting [133.55037976429088]
We investigate the adversarial robustness of vision transformers equipped with BERT pretraining (e.g., BEiT, MAE)
A surprising observation is that MAE has significantly worse adversarial robustness than other BERT pretraining methods.
We propose a simple yet effective way to boost the adversarial robustness of MAE.
arXiv Detail & Related papers (2023-08-20T16:27:17Z) - Black-box Adversarial Attacks against Dense Retrieval Models: A
Multi-view Contrastive Learning Method [115.29382166356478]
We introduce the adversarial retrieval attack (AREA) task.
It is meant to trick DR models into retrieving a target document that is outside the initial set of candidate documents retrieved by the DR model.
We find that the promising results that have previously been reported on attacking NRMs, do not generalize to DR models.
We propose to formalize attacks on DR models as a contrastive learning problem in a multi-view representation space.
arXiv Detail & Related papers (2023-08-19T00:24:59Z) - Beyond Pretrained Features: Noisy Image Modeling Provides Adversarial
Defense [52.66971714830943]
Masked image modeling (MIM) has made it a prevailing framework for self-supervised visual representation learning.
In this paper, we investigate how this powerful self-supervised learning paradigm can provide adversarial robustness to downstream classifiers.
We propose an adversarial defense method, referred to as De3, by exploiting the pretrained decoder for denoising.
arXiv Detail & Related papers (2023-02-02T12:37:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.