Related papers: D2 Actor Critic: Diffusion Actor Meets Distributional Critic

D2 Actor Critic: Diffusion Actor Meets Distributional Critic

URL: http://arxiv.org/abs/2510.03508v2
Date: Tue, 14 Oct 2025 18:08:10 GMT
Title: D2 Actor Critic: Diffusion Actor Meets Distributional Critic
Authors: Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C Stadie,
Abstract summary: We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively.<n>At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time.<n>This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning.<n>The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand
Score: 4.669386607943427
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.

Related papers

Causal Flow Q-Learning for Robust Offline Reinforcement Learning [53.63254824501714]
We introduce a practical implementation that learns expressive flow-matching policies from confounded demonstrations.<n>Our proposed confounding-robust augmentation procedure achieves a success rate 120% that of confounding-unaware, state-of-the-art offline RL methods.
arXiv Detail & Related papers (2026-02-02T21:50:52Z)
Q-learning with Adjoint Matching [58.78551025170267]
We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm.<n>QAM sidesteps two challenges by leveraging adjoint matching, a recently proposed technique in generative modeling.<n>It consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.
arXiv Detail & Related papers (2026-01-20T18:45:34Z)
Sample-Efficient Neurosymbolic Deep Reinforcement Learning [49.60927398960061]
We propose a neuro-symbolic Deep RL approach that integrates background symbolic knowledge to improve sample efficiency.<n>Online reasoning is performed to guide the training process through two mechanisms.<n>We show improved performance over a state-of-the-art reward machine baseline.
arXiv Detail & Related papers (2026-01-06T09:28:53Z)
Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning [89.60378227969643]
We propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision.<n>Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly.<n>Experiments across various tasks and models show that Critique-RL delivers substantial performance improvements.
arXiv Detail & Related papers (2025-10-28T11:37:01Z)
Efficient Online Reinforcement Learning for Diffusion Policy [38.39095131927252]
We generalize the conventional denoising score matching by reweighting the loss function.<n>The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost.<n>We introduce two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC)
arXiv Detail & Related papers (2025-02-01T07:55:06Z)
Exploiting Estimation Bias in Clipped Double Q-Learning for Continous Control Reinforcement Learning Tasks [5.968716050740402]
This paper focuses on addressing and exploiting estimation biases in Actor-Critic methods for continuous control tasks. We design a Bias Exploiting (BE) mechanism to dynamically select the most advantageous estimation bias during training of the RL agent. Most State-of-the-art Deep RL algorithms can be equipped with the BE mechanism, without hindering performance or computational complexity.
arXiv Detail & Related papers (2024-02-14T10:44:03Z)
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning [70.20191211010847]
Offline reinforcement learning (RL) aims to learn an optimal policy using a previously collected static dataset. We introduce Diffusion Q-learning (Diffusion-QL) that utilizes a conditional diffusion model to represent the policy. We show that our method can achieve state-of-the-art performance on the majority of the D4RL benchmark tasks.
arXiv Detail & Related papers (2022-08-12T09:54:11Z)
Adversarially Trained Actor Critic for Offline Reinforcement Learning [42.42451519801851]
ATAC is a new model-free algorithm for offline reinforcement learning under insufficient data coverage. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.
arXiv Detail & Related papers (2022-02-05T01:02:46Z)
Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z)
Combining Pessimism with Optimism for Robust and Efficient Model-Based Deep Reinforcement Learning [56.17667147101263]
In real-world tasks, reinforcement learning agents encounter situations that are not present during training time. To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations. We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem.
arXiv Detail & Related papers (2021-03-18T16:50:17Z)
Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs. We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z)
Robust Deep Reinforcement Learning against Adversarial Perturbations on State Observations [88.94162416324505]
A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises. Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions. We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks.
arXiv Detail & Related papers (2020-03-19T17:59:59Z)
Robust Reinforcement Learning via Adversarial training with Langevin Dynamics [51.234482917047835]
We introduce a sampling perspective to tackle the challenging task of training robust Reinforcement Learning (RL) agents. We present a novel, scalable two-player RL algorithm, which is a sampling variant of the two-player policy method.
arXiv Detail & Related papers (2020-02-14T14:59:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.