Related papers: Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models

URL: http://arxiv.org/abs/2505.10446v2
Date: Wed, 21 May 2025 01:44:47 GMT
Title: Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models
Authors: Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, Guo-Jun Qi,
Abstract summary: We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models.<n>DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought.<n>We show that DCoLT-reinforced Diffusion Language Models (DLMs) outperform other DLMs trained by SFT or RL.
Score: 32.424686185300374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce the Diffusion Chain of Lateral Thought (DCoLT), a reasoning framework for diffusion language models. DCoLT treats each intermediate step in the reverse diffusion process as a latent "thinking" action and optimizes the entire reasoning trajectory to maximize the reward on the correctness of the final answer with outcome-based Reinforcement Learning (RL). Unlike traditional Chain-of-Thought (CoT) methods that follow a causal, linear thinking process, DCoLT allows bidirectional, non-linear reasoning with no strict rule on grammatical correctness amid its intermediate steps of thought. We implement DCoLT on two representative Diffusion Language Models (DLMs). First, we choose SEDD as a representative continuous-time discrete diffusion model, where its concrete score derives a probabilistic policy to maximize the RL reward over the entire sequence of intermediate diffusion steps. We further consider the discrete-time masked diffusion language model -- LLaDA, and find that the order to predict and unmask tokens plays an essential role to optimize its RL action resulting from the ranking-based Unmasking Policy Module (UPM) defined by the Plackett-Luce model. Experiments on both math and code generation tasks show that using only public data and 16 H800 GPUs, DCoLT-reinforced DLMs outperform other DLMs trained by SFT or RL or even both. Notably, DCoLT-reinforced LLaDA boosts its reasoning accuracy by +9.8%, +5.7%, +11.4%, +19.5% on GSM8K, MATH, MBPP, and HumanEval.

Related papers

Discrete Diffusion Trajectory Alignment via Stepwise Decomposition [70.9024656666945]
We propose a novel preference optimization method for masked discrete diffusion models.<n>Instead of applying the reward on the final output and backpropagating the gradient to the entire discrete denoising process, we decompose the problem into a set of stepwise alignment objectives.<n> Experiments across multiple domains including DNA sequence design, protein inverse folding, and language modeling consistently demonstrate the superiority of our approach.
arXiv Detail & Related papers (2025-07-07T09:52:56Z)
Posterior Transition Modeling for Unsupervised Diffusion-Based Speech Enhancement [26.937216751657697]
We explore unsupervised speech enhancement using diffusion models as expressive generative priors for clean speech.<n>Existing approaches guide the reverse diffusion process using noisy speech through an approximate, noise-perturbed likelihood score.<n>We propose two alternative algorithms that directly model the conditional reverse transition distribution of diffusion states.
arXiv Detail & Related papers (2025-07-03T07:42:02Z)
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding [53.82301522384719]
We propose Dimple, the first Discrete Multimodal Large Language Model (DMLLM)<n>We design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase.<n>Dimple-7B surpasses LLaVA- in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-05-22T17:55:04Z)
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning [31.531278643184656]
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL)<n>We propose d1, a framework to adapt pre-trained dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL.<n>We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM.
arXiv Detail & Related papers (2025-04-16T16:08:45Z)
Large Language Diffusion Models [77.02553707673418]
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs)<n>We introduce LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning paradigm.<n>Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines.
arXiv Detail & Related papers (2025-02-14T08:23:51Z)
Theoretical Benefit and Limitation of Diffusion Language Model [47.579673047639126]
Diffusion language models have emerged as a promising approach for text generation.<n>We present a rigorous theoretical analysis of a widely used type of diffusion language model, the Masked Diffusion Model (MDM)<n>Our analysis establishes the first theoretical foundation for understanding the benefits and limitations of MDMs.
arXiv Detail & Related papers (2025-02-13T18:59:47Z)
Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning [43.74071631716718]
We show that DICE-based methods can be viewed as a transformation from the behavior distribution to the optimal policy distribution. We propose a novel approach, Diffusion-DICE, that directly performs this transformation using diffusion models.
arXiv Detail & Related papers (2024-07-29T15:36:42Z)
Understanding Reinforcement Learning-Based Fine-Tuning of Diffusion Models: A Tutorial and Review [63.31328039424469]
This tutorial provides a comprehensive survey of methods for fine-tuning diffusion models to optimize downstream reward functions. We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning.
arXiv Detail & Related papers (2024-07-18T17:35:32Z)
Learning to Reach Goals via Diffusion [16.344212996721346]
We present a novel perspective on goal-conditioned reinforcement learning by framing it within the context of denoising diffusion models. We then learn a goal-conditioned policy to reverse these deviations, analogous to the score function. This approach, which we call Merlin, can reach specified goals from arbitrary initial states without learning a separate value function.
arXiv Detail & Related papers (2023-10-04T00:47:02Z)
Efficient Diffusion Policies for Offline Reinforcement Learning [85.73757789282212]
Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model. We propose efficient diffusion policy (EDP) to overcome these two challenges. EDP constructs actions from corrupted ones at training to avoid running the sampling chain.
arXiv Detail & Related papers (2023-05-31T17:55:21Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.