One-Step Distributional Reinforcement Learning
- URL: http://arxiv.org/abs/2304.14421v1
- Date: Thu, 27 Apr 2023 06:57:00 GMT
- Title: One-Step Distributional Reinforcement Learning
- Authors: Mastane Achab, Reda Alami, Yasser Abdelaziz Dahou Djilali, Kirill
Fedyanin, Eric Moulines
- Abstract summary: We present the simpler one-step distributional reinforcement learning (OS-DistrRL) framework.
We show that our approach comes with a unified theory for both policy evaluation and control.
We propose two OS-DistrRL algorithms for which we provide an almost sure convergence analysis.
- Score: 10.64435582017292
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) allows an agent interacting sequentially with an
environment to maximize its long-term expected return. In the distributional RL
(DistrRL) paradigm, the agent goes beyond the limit of the expected value, to
capture the underlying probability distribution of the return across all time
steps. The set of DistrRL algorithms has led to improved empirical performance.
Nevertheless, the theory of DistrRL is still not fully understood, especially
in the control case. In this paper, we present the simpler one-step
distributional reinforcement learning (OS-DistrRL) framework encompassing only
the randomness induced by the one-step dynamics of the environment. Contrary to
DistrRL, we show that our approach comes with a unified theory for both policy
evaluation and control. Indeed, we propose two OS-DistrRL algorithms for which
we provide an almost sure convergence analysis. The proposed approach compares
favorably with categorical DistrRL on various environments.
Related papers
- Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.
In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.
We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z) - More Benefits of Being Distributional: Second-Order Bounds for
Reinforcement Learning [58.626683114119906]
We show that Distributional Reinforcement Learning (DistRL) can obtain second-order bounds in both online and offline RL.
Our results are the first second-order bounds for low-rank MDPs and for offline RL.
arXiv Detail & Related papers (2024-02-11T13:25:53Z) - Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Policy Evaluation in Distributional LQR [70.63903506291383]
We provide a closed-form expression of the distribution of the random return.
We show that this distribution can be approximated by a finite number of random variables.
Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR.
arXiv Detail & Related papers (2023-03-23T20:27:40Z) - The Nature of Temporal Difference Errors in Multi-step Distributional
Reinforcement Learning [46.85801978792022]
We study the multi-step off-policy learning approach to distributional RL.
We identify a novel notion of path-dependent distributional TD error.
We derive a novel algorithm, Quantile Regression-Retrace, which leads to a deep RL agent QR-DQN-Retrace.
arXiv Detail & Related papers (2022-07-15T16:19:23Z) - Exploration with Multi-Sample Target Values for Distributional
Reinforcement Learning [20.680417111485305]
We introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation.
The improved distributional estimates lend themselves to UCB-based exploration.
We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control.
arXiv Detail & Related papers (2022-02-06T03:27:05Z) - Robustness and risk management via distributional dynamic programming [13.173307471333619]
We introduce a new class of distributional operators, together with a practical DP algorithm for policy evaluation.
Our approach reformulates through an augmented state space where each state is split into a worst-case substate and a best-case substate.
We derive distributional operators and DP algorithms solving a new control task.
arXiv Detail & Related papers (2021-12-28T12:12:57Z) - Forward and inverse reinforcement learning sharing network weights and
hyperparameters [3.705785916791345]
ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process.
A forward RL step minimizes the reverse KL estimated by the inverse RL step.
We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy.
arXiv Detail & Related papers (2020-08-17T13:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.