Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models
- URL: http://arxiv.org/abs/2510.02880v1
- Date: Fri, 03 Oct 2025 10:36:24 GMT
- Title: Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models
- Authors: Tianren Ma, Mu Zhang, Yibing Wang, Qixiang Ye,
- Abstract summary: We introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion.<n>MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality.
- Score: 40.82263997290613
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.
Related papers
- Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models [6.350443894942629]
Multimodal Weight Allocation Module (MWAM) is a plug-and-play component that dynamically re-balances the contribution of each branch during training.<n>MWAM delivers consistent performance gains across a wide range of tasks and modality combinations.
arXiv Detail & Related papers (2026-02-26T05:51:41Z) - Co-GRPO: Co-Optimized Group Relative Policy Optimization for Masked Diffusion Model [74.99242687133408]
Masked Diffusion Models (MDMs) have shown promising potential across vision, language, and cross-modal generation.<n>We introduce Co-GRPO, which reformulates MDM generation as a unified Markov Decision Process (MDP) that jointly incorporates both the model and the inference schedule.
arXiv Detail & Related papers (2025-12-25T12:06:04Z) - Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment [54.17386822940477]
We introduce PromptLoop, a plug-and-play reinforcement learning framework that incorporates latent feedback into step-wise prompt refinement.<n>This design achieves a structural analogy to the Diffusion RL approach, while retaining the flexibility and generality of prompt-based alignment.
arXiv Detail & Related papers (2025-10-01T02:18:58Z) - Reinforcement Learning with Discrete Diffusion Policies for Combinatorial Action Spaces [57.466101098183884]
Reinforcement learning (RL) struggles to scale to large, action spaces common in many real-world problems.<n>This paper introduces a novel framework for training discrete diffusion models as highly effective policies in complex settings.
arXiv Detail & Related papers (2025-09-26T21:53:36Z) - SPREAD: Sampling-based Pareto front Refinement via Efficient Adaptive Diffusion [0.8594140167290097]
SPREAD is a generative framework based on Denoising Diffusion Probabilistic Models (DDPMs)<n>It learns a conditional diffusion process over points sampled from the decision space.<n>It refines candidates via a sampling scheme that uses an adaptive multiple gradient descent-inspired update for fast convergence.
arXiv Detail & Related papers (2025-09-25T12:09:37Z) - Diffusion-Based Symbolic Regression [20.941908494137806]
Diffusion has emerged as a powerful framework for generative modeling, achieving remarkable success in applications such as image and audio synthesis.<n>We propose a novel diffusion-based approach for symbolic regression.<n>We construct a random mask-based diffusion and denoising process to generate diverse and high-quality equations.
arXiv Detail & Related papers (2025-05-30T16:39:29Z) - MMaDA: Multimodal Large Diffusion Language Models [61.13527224215318]
We introduce MMaDA, a novel class of multimodal diffusion foundation models.<n>It is designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation.
arXiv Detail & Related papers (2025-05-21T17:59:05Z) - Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [89.50068130832635]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation MLLMs by multimodal knowledge.<n>We propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z) - Learning Dynamic Representations via An Optimally-Weighted Maximum Mean Discrepancy Optimization Framework for Continual Learning [16.10753846850319]
Continual learning allows models to persistently acquire and retain information.<n> catastrophic forgetting can severely impair model performance.<n>We introduce a novel framework termed Optimally-Weighted Mean Discrepancy (OWMMD), which imposes penalties on representation alterations.
arXiv Detail & Related papers (2025-01-21T13:33:45Z) - Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z) - Learning to Rebalance Multi-Modal Optimization by Adaptively Masking Subnetworks [13.065212096469537]
We propose a novel importance sampling-based, element-wise joint optimization method, called Adaptively Mask Subnetworks Considering Modal Significance(AMSS)
Specifically, we incorporate mutual information rates to determine the modal significance and employ non-uniform adaptive sampling to select foregroundworks from each modality for parameter updates.
Building upon theoretical insights, we further enhance the multi-modal mask subnetwork strategy using unbiased estimation, referred to as AMSS+.
arXiv Detail & Related papers (2024-04-12T09:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.