d2: Improved Techniques for Training Reasoning Diffusion Language Models
- URL: http://arxiv.org/abs/2509.21474v2
- Date: Mon, 29 Sep 2025 01:33:05 GMT
- Title: d2: Improved Techniques for Training Reasoning Diffusion Language Models
- Authors: Guanghan Wang, Yair Schiff, Gilad Turok, Volodymyr Kuleshov,
- Abstract summary: We introduce d2, a reasoning framework tailored for masked diffusion language models (DLMs)<n>Central to our framework is a new policy gradient algorithm that relies on properties of masking to accurately estimate the likelihoods of sampling trajectories.<n> Empirically, d2 significantly improves over previous diffusion reasoning frameworks using only RL.
- Score: 18.84834746600858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on properties of masking to accurately estimate the likelihoods of sampling trajectories. Our estimators trade off computation for approximation accuracy in an analytically tractable manner, and are particularly effective for DLMs that support any-order likelihood estimation. We characterize and study this property in popular DLMs and show that it is key for efficient diffusion-based reasoning. Empirically, d2 significantly improves over previous diffusion reasoning frameworks using only RL (without relying on supervised fine-tuning), and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500).
Related papers
- RADAR: Revealing Asymmetric Development of Abilities in MLLM Pre-training [59.493415006017635]
Pre-trained Multi-modal Large Language Models (MLLMs) provide a knowledge-rich foundation for post-training.<n>Current evaluation relies on testing after supervised fine-tuning, which introduces laborious additional training and autoregressive decoding costs.<n>We propose RADAR, an efficient ability-centric evaluation framework for Revealing Asymmetric Development of Abilities in MLLM pRe-training.
arXiv Detail & Related papers (2026-02-13T12:56:31Z) - Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed [76.49335677120031]
Diffusion language models (dLMs) have emerged as a promising paradigm that enables parallel, non-autoregressive generation.<n>We study AR-to-dLM conversion to transform pretrained AR models into efficient dLMs that excel in speed while preserving AR models' task accuracy.
arXiv Detail & Related papers (2025-12-16T04:12:17Z) - How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices [81.85465545346266]
Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm.<n>Yet, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility.<n>This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods.
arXiv Detail & Related papers (2025-10-21T10:00:32Z) - DLM-One: Diffusion Language Models for One-Step Sequence Generation [63.43422118066493]
DLM-One is a score-distillation-based framework for one-step sequence generation with continuous diffusion language models.<n>We investigate whether DLM-One can achieve substantial gains in sampling efficiency for language modeling.
arXiv Detail & Related papers (2025-05-30T22:42:23Z) - MDPO: Multi-Granularity Direct Preference Optimization for Mathematical Reasoning [0.0]
We propose the Multi-Granularity Direct Preference Optimization (MDPO) method, optimizing the mathematical reasoning of Large Language Models (LLMs)<n>We conduct experiments on the open-source models Qwen2 and Llama3, achieving improvements of 1.7% and 1.2% on the GSM8K dataset, and 2.3% and 1.2% on the MATH dataset.<n>We also provide a pipeline for constructing MDPO training data that is simple and does not require manual annotation costs.
arXiv Detail & Related papers (2025-05-30T08:42:14Z) - Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models [32.424686185300374]
We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models.<n>DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought.<n>We show that DCoLT-reinforced Diffusion Language Models (DLMs) outperform other DLMs trained by SFT or RL.
arXiv Detail & Related papers (2025-05-15T16:06:32Z) - d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [31.531278643184656]
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL)<n>We propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL.<n>We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.
arXiv Detail & Related papers (2025-04-16T16:08:45Z) - Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.<n>Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.<n>We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv Detail & Related papers (2025-02-04T17:26:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.