The Nature of Temporal Difference Errors in Multi-step Distributional
Reinforcement Learning
- URL: http://arxiv.org/abs/2207.07570v1
- Date: Fri, 15 Jul 2022 16:19:23 GMT
- Title: The Nature of Temporal Difference Errors in Multi-step Distributional
Reinforcement Learning
- Authors: Yunhao Tang, Mark Rowland, R\'emi Munos, Bernardo \'Avila Pires, Will
Dabney, Marc G. Bellemare
- Abstract summary: We study the multi-step off-policy learning approach to distributional RL.
We identify a novel notion of path-dependent distributional TD error.
We derive a novel algorithm, Quantile Regression-Retrace, which leads to a deep RL agent QR-DQN-Retrace.
- Score: 46.85801978792022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the multi-step off-policy learning approach to distributional RL.
Despite the apparent similarity between value-based RL and distributional RL,
our study reveals intriguing and fundamental differences between the two cases
in the multi-step setting. We identify a novel notion of path-dependent
distributional TD error, which is indispensable for principled multi-step
distributional RL. The distinction from the value-based case bears important
implications on concepts such as backward-view algorithms. Our work provides
the first theoretical guarantees on multi-step off-policy distributional RL
algorithms, including results that apply to the small number of existing
approaches to multi-step distributional RL. In addition, we derive a novel
algorithm, Quantile Regression-Retrace, which leads to a deep RL agent
QR-DQN-Retrace that shows empirical improvements over QR-DQN on the Atari-57
benchmark. Collectively, we shed light on how unique challenges in multi-step
distributional RL can be addressed both in theory and practice.
Related papers
- More Benefits of Being Distributional: Second-Order Bounds for
Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL.
Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z) - Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Keep Various Trajectories: Promoting Exploration of Ensemble Policies in
Continuous Control [17.64972760231609]
This study introduces a new ensemble RL algorithm, termed TEEN.
TEEN enhances the sample diversity of the ensemble policy compared to using sub-policies alone.
On average, TEEN outperforms the baseline ensemble DRL algorithms by 41% in performance on the tested representative environments.
arXiv Detail & Related papers (2023-10-17T10:40:05Z) - One-Step Distributional Reinforcement Learning [10.64435582017292]
We present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework.
We show that our approach comes with a unified theory for both policy evaluation and control.
We propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis.
arXiv Detail & Related papers (2023-04-27T06:57:00Z) - Policy Evaluation in Distributional LQR [70.63903506291383]
We provide a closed-form expression of the distribution of the random return.
We show that this distribution can be approximated by a finite number of random variables.
Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR.
arXiv Detail & Related papers (2023-03-23T20:27:40Z) - Offline Policy Optimization in RL with Variance Regularizaton [142.87345258222942]
We propose variance regularization for offline RL algorithms, using stationary distribution corrections.
We show that by using Fenchel duality, we can avoid double sampling issues for computing the gradient of the variance regularizer.
The proposed algorithm for offline variance regularization (OVAR) can be used to augment any existing offline policy optimization algorithms.
arXiv Detail & Related papers (2022-12-29T18:25:01Z) - Branching Reinforcement Learning [16.437993672422955]
We propose a novel Branching Reinforcement Learning (Branching RL) model.
We investigate Regret Minimization (RM) and Reward-Free Exploration (RFE) metrics for this model.
This model finds important applications in hierarchical recommendation systems and online advertising.
arXiv Detail & Related papers (2022-02-16T11:19:03Z) - Distributional Reinforcement Learning for Multi-Dimensional Reward
Functions [91.88969237680669]
We introduce Multi-Dimensional Distributional DQN (MD3QN) to model the joint return distribution from multiple reward sources.
As a by-product of joint distribution modeling, MD3QN can capture the randomness in returns for each source of reward.
In experiments, our method accurately models the joint return distribution in environments with richly correlated reward functions.
arXiv Detail & Related papers (2021-10-26T11:24:23Z) - Forward and inverse reinforcement learning sharing network weights and
hyperparameters [3.705785916791345]
ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process.
A forward RL step minimizes the reverse KL estimated by the inverse RL step.
We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy.
arXiv Detail & Related papers (2020-08-17T13:12:44Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.