MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization
- URL: http://arxiv.org/abs/2510.21473v1
- Date: Fri, 24 Oct 2025 13:57:59 GMT
- Title: MRO: Enhancing Reasoning in Diffusion Language Models via Multi-Reward Optimization
- Authors: Chenglong Wang, Yang Gan, Hang Zhou, Chi Hu, Yongyu Mu, Kai Song, Murun Yang, Bei Li, Chunliang Zhang, Tongran Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao,
- Abstract summary: diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs)<n>DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases.<n>We propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process.
- Score: 66.82303841930752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in diffusion language models (DLMs) have presented a promising alternative to traditional autoregressive large language models (LLMs). However, DLMs still lag behind LLMs in reasoning performance, especially as the number of denoising steps decreases. Our analysis reveals that this shortcoming arises primarily from the independent generation of masked tokens across denoising steps, which fails to capture the token correlation. In this paper, we define two types of token correlation: intra-sequence correlation and inter-sequence correlation, and demonstrate that enhancing these correlations improves reasoning performance. To this end, we propose a Multi-Reward Optimization (MRO) approach, which encourages DLMs to consider the token correlation during the denoising process. More specifically, our MRO approach leverages test-time scaling, reject sampling, and reinforcement learning to directly optimize the token correlation with multiple elaborate rewards. Additionally, we introduce group step and importance sampling strategies to mitigate reward variance and enhance sampling efficiency. Through extensive experiments, we demonstrate that MRO not only improves reasoning performance but also achieves significant sampling speedups while maintaining high performance on reasoning benchmarks.
Related papers
- Recursive Think-Answer Process for LLMs and VLMs [54.52289112197118]
We propose an efficient Recursive Think-Answer Process (R-TAP)<n>R-TAP enables models to engage in iterative reasoning cycles and generate more accurate answers.<n>We show that R-TAP-enhanced models consistently outperform conventional single-pass methods.
arXiv Detail & Related papers (2026-03-02T17:20:10Z) - Rectifying LLM Thought from Lens of Optimization [48.98086817378953]
Long chain-of-thought (CoT) prompting enables thorough exploration and deliberation.<n>Despite advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors.<n>We introduce RePro, a novel approach to refine LLM reasoning during post-training.
arXiv Detail & Related papers (2025-12-01T17:41:08Z) - A Comprehensive Study on Visual Token Redundancy for Discrete Diffusion-based Multimodal Large Language Models [85.30893355216486]
We study how visual token redundancy evolves with different dMLLM architectures and tasks.<n>Our study reveals that visual redundancy emerges only in from-scratch dMLLMs while handling long-answer tasks.<n>Layer-skipping is promising for accelerating AR-to-diffusion dMLLMs, whereas progressive or late-step pruning is more effective for from-scratch dMLLMs.
arXiv Detail & Related papers (2025-11-19T04:13:36Z) - HatLLM: Hierarchical Attention Masking for Enhanced Collaborative Modeling in LLM-based Recommendation [17.271853114690902]
HatLLM is a hierarchical attention masking strategy for large language models (LLMs) for sequential recommendation.<n>HatLLM achieves significant performance gains (9.13% on average) over existing LLM-based methods.
arXiv Detail & Related papers (2025-10-13T03:05:03Z) - Beyond Next-Token Prediction: A Performance Characterization of Diffusion versus Autoregressive Language Models [82.87985794856803]
Large Language Models (LLMs) have achieved state-of-the-art performance on a broad range of Natural Language Processing (NLP) tasks.<n>Recently, Diffusion Language Models (DLMs) have emerged as a promising alternative architecture.
arXiv Detail & Related papers (2025-10-05T10:50:52Z) - WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z) - Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst [42.40884882220895]
We introduce textitSelf-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and improve performance through self-training.<n>Our proposed SRLM achieves an average absolute improvement of more than $+2.5$ points across five reasoning tasks.
arXiv Detail & Related papers (2025-05-20T09:21:26Z) - d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [31.531278643184656]
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL)<n>We propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL.<n>We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.
arXiv Detail & Related papers (2025-04-16T16:08:45Z) - Don't Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models [32.71672086718058]
Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs)<n>We observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs.<n>We propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens.
arXiv Detail & Related papers (2025-03-14T07:46:33Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Causal Prompting: Debiasing Large Language Model Prompting based on Front-Door Adjustment [32.12998469814097]
A novel causal prompting method based on front-door adjustment is proposed to effectively mitigate Large Language Models (LLMs) biases.<n> Experimental results show that the proposed causal prompting approach achieves excellent performance across seven natural language processing datasets.
arXiv Detail & Related papers (2024-03-05T07:47:34Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.