RIV: Recursive Introspection Mask Diffusion Vision Language Model
- URL: http://arxiv.org/abs/2509.23625v1
- Date: Sun, 28 Sep 2025 04:01:46 GMT
- Title: RIV: Recursive Introspection Mask Diffusion Vision Language Model
- Authors: YuQian Li, Limeng Qiao, Lin Ma,
- Abstract summary: Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks.<n>These models are unable to correct errors in generated tokens, meaning they lack self-correction capability.<n>We propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability.
- Score: 10.955541881166782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mask Diffusion-based Vision Language Models (MDVLMs) have achieved remarkable progress in multimodal understanding tasks. However, these models are unable to correct errors in generated tokens, meaning they lack self-correction capability. In this paper, we propose Recursive Introspection Mask Diffusion Vision Language Model (RIV), which equips the model with self-correction ability through two novel mechanisms. The first is Introspection Training, where an Introspection Model is introduced to identify errors within generated sequences. Introspection Training enables the model to detect not only grammatical and spelling mistakes, but more importantly, logical errors. The second is Recursive Inference. Beginning with the standard unmasking step, the learned Introspection Model helps to identify errors in the output sequence and remask them. This alternating ($\text{unmask}\rightarrow\text{introspection}\rightarrow\text{remask}$) process is repeated recursively until reliable results are obtained. Experimental results on multiple benchmarks demonstrate that the proposed RIV achieves state-of-the-art performance, outperforming most existing MDVLMs.
Related papers
- MMErroR: A Benchmark for Erroneous Reasoning in Vision-Language Models [29.830224745428566]
We present MMErroR, a benchmark of 2,013 samples each embedding a single coherent reasoning error.<n>Unlike existing benchmarks that focus on answer correctness, MMErroR targets a process-level, error-centric evaluation.<n>We evaluate 20 advanced Vision-Language Models, even the best model (Gemini-3.0-Pro) classifies the error in only 66.47% of cases.
arXiv Detail & Related papers (2026-01-06T17:45:26Z) - WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens [69.97021957331326]
We propose Noisy Query Tokens, which learn a distributed representation space between the VLM and Diffusion Model via end-to-end optimization.<n>We also introduce a VAE branch with linear projection to recover fine-grained image details.
arXiv Detail & Related papers (2025-12-02T09:02:20Z) - FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges [85.24983823102262]
We propose a structured methodology for evaluating text-to-image (T2I) models and vision language models (VLMs)<n>We test whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts.<n>Our findings suggest that current metrics are insufficient to capture these nuanced errors.
arXiv Detail & Related papers (2025-12-01T19:46:03Z) - Don't Settle Too Early: Self-Reflective Remasking for Diffusion Language Models [40.902681492117786]
RemeDi is a mask-based DLM that predicts token distributions and per-token confidence scores at each step.<n>We train a remask-aware pipeline to train this ability, including supervised fine-tuning which teaches the model to detect and remask incorrect tokens.<n>Experiments show that RemeDi achieves the state-of-the-art results among open-source DLMs on multiple datasets.
arXiv Detail & Related papers (2025-09-28T05:39:49Z) - Understanding and Enhancing Mask-Based Pretraining towards Universal Representations [13.262679155411599]
Mask-based pretraining has become a cornerstone of modern large-scale models across language, vision, and biology.<n>We show that the behavior of mask-based pretraining can be directly characterized by test risk in high-dimensional minimum-norm ("ridge-less") linear regression.<n>We propose Randomly Random Mask Auto (R$2$MAE), which enforces capturing multi-scale features from data.
arXiv Detail & Related papers (2025-09-25T22:08:25Z) - SUDER: Self-Improving Unified Large Multimodal Models for Understanding and Generation with Dual Self-Rewards [55.99492656542475]
We propose textbfSUDER (textbfSelf-improving textbfUnified LMMs with textbfDual stextbfElf-textbfRewards), a framework reinforcing the understanding and generation capabilities of LMMs.
arXiv Detail & Related papers (2025-06-09T17:38:45Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Beyond Image-Text Matching: Verb Understanding in Multimodal
Transformers Using Guided Masking [0.4543820534430524]
This work introduces an alternative probing strategy called guided masking.
The proposed approach ablates different modalities using masking and assesses the model's ability to predict the masked word with high accuracy.
We show that guided masking on ViLBERT, LXMERT, UNITER, and VisualBERT can predict the correct verb with high accuracy.
arXiv Detail & Related papers (2024-01-29T21:22:23Z) - Which Syntactic Capabilities Are Statistically Learned by Masked
Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities.
To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z) - Masked Language Model Based Textual Adversarial Example Detection [14.734863175424797]
Adrial attacks are a serious threat to reliable deployment of machine learning models in safety-critical applications.
We propose a novel textual adversarial example detection method, namely Masked Model-based Detection (MLMD)
arXiv Detail & Related papers (2023-04-18T06:52:14Z) - uChecker: Masked Pretrained Language Models as Unsupervised Chinese
Spelling Checkers [23.343006562849126]
We propose a framework named textbfuChecker to conduct unsupervised spelling error detection and correction.
Masked pretrained language models such as BERT are introduced as the backbone model.
Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model.
arXiv Detail & Related papers (2022-09-15T05:57:12Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.