Wasserstein Actor-Critic: Directed Exploration via Optimism for
Continuous-Actions Control
- URL: http://arxiv.org/abs/2303.02378v1
- Date: Sat, 4 Mar 2023 10:52:20 GMT
- Title: Wasserstein Actor-Critic: Directed Exploration via Optimism for
Continuous-Actions Control
- Authors: Amarildo Likmeta, Matteo Sacco, Alberto Maria Metelli and Marcello
Restelli
- Abstract summary: Wasserstein Actor-Critic ( WAC) is an actor-critic architecture inspired by the Wasserstein Q-Learning (WQL) citepwql.
WAC enforces exploration in a principled way by guiding the policy learning process with the optimization of an upper bound of the Q-value estimates.
- Score: 41.7453231409493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Uncertainty quantification has been extensively used as a means to achieve
efficient directed exploration in Reinforcement Learning (RL). However,
state-of-the-art methods for continuous actions still suffer from high sample
complexity requirements. Indeed, they either completely lack strategies for
propagating the epistemic uncertainty throughout the updates, or they mix it
with aleatoric uncertainty while learning the full return distribution (e.g.,
distributional RL). In this paper, we propose Wasserstein Actor-Critic (WAC),
an actor-critic architecture inspired by the recent Wasserstein Q-Learning
(WQL) \citep{wql}, that employs approximate Q-posteriors to represent the
epistemic uncertainty and Wasserstein barycenters for uncertainty propagation
across the state-action space. WAC enforces exploration in a principled way by
guiding the policy learning process with the optimization of an upper bound of
the Q-value estimates. Furthermore, we study some peculiar issues that arise
when using function approximation, coupled with the uncertainty estimation, and
propose a regularized loss for the uncertainty estimation. Finally, we evaluate
our algorithm on standard MujoCo tasks as well as suite of continuous-actions
domains, where exploration is crucial, in comparison with state-of-the-art
baselines.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Model-Based Uncertainty in Value Functions [89.31922008981735]
We focus on characterizing the variance over values induced by a distribution over MDPs.
Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation.
We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values.
arXiv Detail & Related papers (2023-02-24T09:18:27Z) - Distributionally Robust Model-Based Offline Reinforcement Learning with
Near-Optimal Sample Complexity [39.886149789339335]
offline reinforcement learning aims to learn to perform decision making from history data without active exploration.
Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset.
We consider a distributionally robust formulation of offline RL, focusing on robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings.
arXiv Detail & Related papers (2022-08-11T11:55:31Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - Temporal Difference Uncertainties as a Signal for Exploration [76.6341354269013]
An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy.
In this paper, we highlight that value estimates are easily biased and temporally inconsistent.
We propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors.
arXiv Detail & Related papers (2020-10-05T18:11:22Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.