Related papers: MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

URL: http://arxiv.org/abs/2511.09611v3
Date: Tue, 18 Nov 2025 06:18:34 GMT
Title: MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Authors: Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li,
Abstract summary: We propose a new benchmark designed to evaluate both text and image output modalities.<n>This performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image.<n>We propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images.
Score: 86.82285754460491
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9\% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

Related papers

Auto-Regressive Masked Diffusion Models [9.239507801466322]
Masked diffusion models (MDMs) have emerged as a promising approach for language modeling.<n>They face a performance gap compared to autoregressive models (ARMs) and require more training iterations.<n>We present the Auto-Regressive Masked Diffusion model, which unifies the training efficiency of autoregressive models with the parallel generation capabilities of diffusion-based models.
arXiv Detail & Related papers (2026-01-23T18:42:30Z)
Parallel Diffusion Solver via Residual Dirichlet Policy Optimization [88.7827307535107]
Diffusion models (DMs) have achieved state-of-the-art generative performance but suffer from high sampling latency due to their sequential denoising nature.<n>Existing solver-based acceleration methods often face significant image quality degradation under a low-dimensional budget.<n>We propose the Ensemble Parallel Direction solver (dubbed as EPD-EPr), a novel ODE solver that mitigates these errors by incorporating multiple gradient parallel evaluations in each step.
arXiv Detail & Related papers (2025-12-28T05:48:55Z)
THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning [25.605096023894834]
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning.<n>Despite recent advances, existing methods struggle with three key challenges.<n>We propose THOR (Tool-Integrated Hierarchical Optimization via RL) to overcome these limitations.<n>Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models.
arXiv Detail & Related papers (2025-09-17T07:16:12Z)
Equivariant Image Modeling [17.87342493584631]
Current generative models, such as autoregressive and diffusion approaches, decompose high-dimensional data distribution learning into simpler subtasks.<n>We propose a novel equivariant image modeling framework that inherently aligns optimization targets across subtasks.<n>Our approach achieves performance comparable to state-of-the-art AR models while using fewer computational resources.
arXiv Detail & Related papers (2025-03-24T17:59:57Z)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.<n>Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way.<n>We also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals.
arXiv Detail & Related papers (2024-10-14T17:57:18Z)
Corner-to-Center Long-range Context Model for Efficient Learned Image Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations. We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions. In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z)
Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation [15.411325887412413]
This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM) FSA-CDM introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Experiments are conducted on four benchmark datasets from different domains.
arXiv Detail & Related papers (2023-08-02T13:43:03Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)
Mitigating Modality Collapse in Multimodal VAEs via Impartial Optimization [7.4262579052708535]
We argue that this effect is a consequence of conflicting gradients during multimodal VAE training. We show how to detect the sub-graphs in the computational graphs where gradients conflict. We empirically show that our framework significantly improves the reconstruction performance, conditional generation, and coherence of the latent space across modalities.
arXiv Detail & Related papers (2022-06-09T13:29:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.