Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models
- URL: http://arxiv.org/abs/2505.10446v2
- Date: Wed, 21 May 2025 01:44:47 GMT
- Title: Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models
- Authors: Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, Guo-Jun Qi,
- Abstract summary: We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models.<n>DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought.<n>We show that DCoLT-reinforced Diffusion Language Models (DLMs) outperform other DLMs trained by SFT or RL.
- Score: 32.424686185300374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.
Related papers
- Discrete Diffusion Trajectory Alignment via Stepwise Decomposition [70.9024656666945]
We propose a novel preference optimization method for masked discrete diffusion models.<n>Instead of applying the reward on the final output and backpropagating the gradient to the entire discrete denoising process, we decompose the problem into a set of stepwise alignment objectives.<n> Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach.
arXiv Detail & Related papers (2025-07-07T09:52:56Z) - Posterior Transition Modeling for Unsupervised Diffusion-Based Speech Enhancement [26.937216751657697]
We explore unsupervised speech enhancement using diffusion models as expressive generative priors for clean speech.<n>Existing approaches guide the reverse diffusion process using noisy speech through an approximate, noise-perturbed likelihood score.<n>We propose two alternative algorithms that directly model the conditional reverse transition distribution of diffusion states.
arXiv Detail & Related papers (2025-07-03T07:42:02Z) - Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding [53.82301522384719]
We propose Dimple, the first Discrete Multimodal Large Language Model (DMLLM)<n>We design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase.<n>Dimple-7B surpasses LLaVA- in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-05-22T17:55:04Z) - d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [31.531278643184656]
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL)<n>We propose d1, a framework to adapt pre-trained dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL.<n>We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.
arXiv Detail & Related papers (2025-04-16T16:08:45Z) - Large Language Diffusion Models [77.02553707673418]
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs)<n>We introduce LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning paradigm.<n>Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines.
arXiv Detail & Related papers (2025-02-14T08:23:51Z) - Theoretical Benefit and Limitation of Diffusion Language Model [47.579673047639126]
Diffusion language models have emerged as a promising approach for text generation.<n>We present a rigorous theoretical analysis of a widely used type of diffusion language model, the Masked Diffusion Model (MDM)<n>Our analysis establishes the first theoretical foundation for understanding the benefits and limitations of MDMs.
arXiv Detail & Related papers (2025-02-13T18:59:47Z) - Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning [43.74071631716718]
We show that DICE-based methods can be viewed as a transformation from the behavior distribution to the optimal policy distribution.
We propose a novel approach, Diffusion-DICE, that directly performs this transformation using diffusion models.
arXiv Detail & Related papers (2024-07-29T15:36:42Z) - Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review [63.31328039424469]
This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions.
We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning.
arXiv Detail & Related papers (2024-07-18T17:35:32Z) - Learning to Reach Goals via Diffusion [16.344212996721346]
We present a novel perspective on goal-conditioned reinforcement learning by framing it within the context of denoising diffusion models.
We then learn a goal-conditioned policy to reverse these deviations, analogous to the score function.
This approach, which we call Merlin, can reach specified goals from arbitrary initial states without learning a separate value function.
arXiv Detail & Related papers (2023-10-04T00:47:02Z) - Efficient Diffusion Policies for Offline Reinforcement Learning [85.73757789282212]
Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model.
We propose efficient diffusion policy (EDP) to overcome these two challenges.
EDP constructs actions from corrupted ones at training to avoid running the sampling chain.
arXiv Detail & Related papers (2023-05-31T17:55:21Z) - Diffusion Policies as an Expressive Policy Class for Offline
Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset.
We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy.
We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.