Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation
- URL: http://arxiv.org/abs/2508.13904v2
- Date: Wed, 01 Oct 2025 06:13:26 GMT
- Title: Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation
- Authors: Thanh Nguyen, Chang D. Yoo,
- Abstract summary: Diffusion Q-Learning (DQL) has established diffusion policies as a high-performing paradigm for offline reinforcement learning.<n>Current efforts to accelerate DQL toward one-step denoising rely on auxiliary modules or policy distillation.<n>One-Step Flow Q-Learning (OFQL) enables effective one-step action generation during both training and inference.
- Score: 34.007490094198424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Q-Learning (DQL) has established diffusion policies as a high-performing paradigm for offline reinforcement learning, but its reliance on multi-step denoising for action generation renders both training and inference slow and fragile. Existing efforts to accelerate DQL toward one-step denoising typically rely on auxiliary modules or policy distillation, sacrificing either simplicity or performance. It remains unclear whether a one-step policy can be trained directly without such trade-offs. To this end, we introduce One-Step Flow Q-Learning (OFQL), a novel framework that enables effective one-step action generation during both training and inference, without auxiliary modules or distillation. OFQL reformulates the DQL policy within the Flow Matching (FM) paradigm but departs from conventional FM by learning an average velocity field that directly supports accurate one-step action generation. This design removes the need for multi-step denoising and backpropagation-through-time updates, resulting in substantially faster and more robust learning. Extensive experiments on the D4RL benchmark show that OFQL, despite generating actions in a single step, not only significantly reduces computation during both training and inference but also outperforms multi-step DQL by a large margin. Furthermore, OFQL surpasses all other baselines, achieving state-of-the-art performance in D4RL.
Related papers
- One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow [56.13949180229929]
We introduce a one-step generative policy for offline reinforcement learning that maps noise directly to actions via a residual reformulation of MeanFlow.<n>Our method achieves strong performance in both offline and offline-to-online reinforcement learning settings.
arXiv Detail & Related papers (2025-11-17T06:34:17Z) - Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model [98.35868970993232]
Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm.<n>We introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber) to achieve better inference speed and output quality in code generation.
arXiv Detail & Related papers (2025-10-20T23:38:12Z) - Shortcut Flow Matching for Speech Enhancement: Step-Invariant flows via single stage training [20.071957855504206]
Diffusion-based generative models have achieved state-of-the-art performance for perceptual quality in speech enhancement.<n>We introduce Shortcut Flow Matching for Speech Enhancement (SFMSE), a novel approach that trains a single, step-invariant model.<n>Our results demonstrate that a single-step SFMSE inference achieves a real-time factor (RTF) of 0.013 on a consumer GPU.
arXiv Detail & Related papers (2025-09-25T20:09:05Z) - Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling [69.84963245729826]
We propose an auxiliary task of QL to enhance the backbone for subsequent contrastive learning of the retriever.<n>We introduce our model, which incorporates two key components: Attention Block (AB) and Document Corruption (DC)
arXiv Detail & Related papers (2025-04-07T16:03:59Z) - Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization [83.65278205301576]
We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots.<n>This is achieved through an optimization consistency training protocol, which minimizes the difference among samples.<n>Experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency.
arXiv Detail & Related papers (2025-02-05T07:13:43Z) - Diffusion Policies creating a Trust Region for Offline Reinforcement Learning [66.17291150498276]
We introduce a dual policy approach, Diffusion Trusted Q-Learning (DTQL), which comprises a diffusion policy for pure behavior cloning and a practical one-step policy.
DTQL eliminates the need for iterative denoising sampling during both training and inference, making it remarkably computationally efficient.
We show that DTQL could not only outperform other methods on the majority of the D4RL benchmark tasks but also demonstrate efficiency in training and inference speeds.
arXiv Detail & Related papers (2024-05-30T05:04:33Z) - Boosting Continuous Control with Consistency Policy [14.78980095597872]
We propose a novel time-efficiency method named Consistency Policy with Q-Learning (CPQL)
By establishing a mapping from the reverse diffusion trajectories to the desired policy, we simultaneously address the issues of time efficiency and inaccurate guidance.
CPQL achieves new state-of-the-art performance on 11 offline and 21 online tasks, significantly improving inference speed by nearly 45 times compared to Diffusion-QL.
arXiv Detail & Related papers (2023-10-10T06:26:05Z) - IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive
Control [8.374040635931298]
We introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL)
More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning.
We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents
arXiv Detail & Related papers (2023-06-01T16:24:40Z) - Equalization Loss v2: A New Gradient Balance Approach for Long-tailed
Object Detection [12.408265499394089]
Recently proposed decoupled training methods emerge as a dominant paradigm for long-tailed object detection.
End-to-end training methods, like equalization loss (EQL), still perform worse than decoupled training methods.
EQL v2 is a novel gradient guided reweighing mechanism that re-balances the training process for each category independently and equally.
arXiv Detail & Related papers (2020-12-15T19:01:48Z) - Single-partition adaptive Q-learning [0.0]
Single- Partition adaptive Q-learning (SPAQL) is an algorithm for model-free episodic reinforcement learning.
Tests on episodes with a large number of time steps show that SPAQL has no problems scaling, unlike adaptive Q-learning (AQL)
We claim that SPAQL may have a higher sample efficiency than AQL, thus being a relevant contribution to the field of efficient model-free RL methods.
arXiv Detail & Related papers (2020-07-14T00:03:25Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.