Normality-Guided Distributional Reinforcement Learning for Continuous
Control
- URL: http://arxiv.org/abs/2208.13125v3
- Date: Wed, 17 Jan 2024 22:55:00 GMT
- Title: Normality-Guided Distributional Reinforcement Learning for Continuous
Control
- Authors: Ju-Seung Byun, Andrew Perrault
- Abstract summary: Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms.
We study the value distribution in several continuous control tasks and find that the learned value distribution is empirical quite close to normal.
We propose a policy update strategy based on the correctness as measured by structural characteristics of the value distribution not present in the standard value function.
- Score: 16.324313304691426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning a predictive model of the mean return, or value function, plays a
critical role in many reinforcement learning algorithms. Distributional
reinforcement learning (DRL) has been shown to improve performance by modeling
the value distribution, not just the mean. We study the value distribution in
several continuous control tasks and find that the learned value distribution
is empirical quite close to normal. We design a method that exploits this
property, employ variances predicted from a variance network, along with
returns, to analytically compute target quantile bars representing a normal for
our distributional value function. In addition, we propose a policy update
strategy based on the correctness as measured by structural characteristics of
the value distribution not present in the standard value function. The approach
we outline is compatible with many DRL structures. We use two representative
on-policy algorithms, PPO and TRPO, as testbeds. Our method yields
statistically significant improvements in 10 out of 16 continuous task
settings, while utilizing a reduced number of weights and achieving faster
training time compared to an ensemble-based method for quantifying value
distribution uncertainty.
Related papers
- Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Distributional Reinforcement Learning with Dual Expectile-Quantile Regression [51.87411935256015]
quantile regression approach to distributional RL provides flexible and effective way of learning arbitrary return distributions.
We show that distributional guarantees vanish, and we empirically observe that the estimated distribution rapidly collapses to its mean estimation.
Motivated by the efficiency of $L$-based learning, we propose to jointly learn expectiles and quantiles of the return distribution in a way that allows efficient learning while keeping an estimate of the full distribution of returns.
arXiv Detail & Related papers (2023-05-26T12:30:05Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Exploration with Multi-Sample Target Values for Distributional
Reinforcement Learning [20.680417111485305]
We introduce multi-sample target values (MTV) for distributional RL, as a principled replacement for single-sample target value estimation.
The improved distributional estimates lend themselves to UCB-based exploration.
We evaluate our approach on a range of continuous control tasks and demonstrate state-of-the-art model-free performance on difficult tasks such as Humanoid control.
arXiv Detail & Related papers (2022-02-06T03:27:05Z) - Learning Calibrated Uncertainties for Domain Shift: A Distributionally
Robust Learning Approach [150.8920602230832]
We propose a framework for learning calibrated uncertainties under domain shifts.
In particular, the density ratio estimation reflects the closeness of a target (test) sample to the source (training) distribution.
We show that our proposed method generates calibrated uncertainties that benefit downstream tasks.
arXiv Detail & Related papers (2020-10-08T02:10:54Z) - A Distributional Analysis of Sampling-Based Reinforcement Learning
Algorithms [67.67377846416106]
We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes.
We show that value-based methods such as TD($lambda$) and $Q$-Learning have update rules which are contractive in the space of distributions of functions.
arXiv Detail & Related papers (2020-03-27T05:13:29Z) - Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for
Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations.
We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.