Balancing Value Underestimation and Overestimation with Realistic
Actor-Critic
- URL: http://arxiv.org/abs/2110.09712v2
- Date: Wed, 20 Oct 2021 00:59:22 GMT
- Title: Balancing Value Underestimation and Overestimation with Realistic
Actor-Critic
- Authors: Sicen Li, Gang Wang, Qinyun Tang, Liquan Wang
- Abstract summary: This paper introduces a novel model-free algorithm, Realistic Actor-Critic(RAC), which can be incorporated with any off-policy RL algorithms to improve sample efficiency.
RAC employs Universal Value Function Approximators (UVFA) to simultaneously learn a policy family with the same neural network, each with different trade-offs between underestimation and overestimation.
We evaluate RAC on the MuJoCo benchmark, achieving 10x sample efficiency and 25% performance improvement on the most challenging Humanoid environment compared to SAC.
- Score: 6.205681604290727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-free deep reinforcement learning (RL) has been successfully applied to
challenging continuous control domains. However, poor sample efficiency
prevents these methods from being widely used in real-world domains. This paper
introduces a novel model-free algorithm, Realistic Actor-Critic(RAC), which can
be incorporated with any off-policy RL algorithms to improve sample efficiency.
RAC employs Universal Value Function Approximators (UVFA) to simultaneously
learn a policy family with the same neural network, each with different
trade-offs between underestimation and overestimation. To learn such policies,
we introduce uncertainty punished Q-learning, which uses uncertainty from the
ensembling of multiple critics to build various confidence-bounds of
Q-function. We evaluate RAC on the MuJoCo benchmark, achieving 10x sample
efficiency and 25% performance improvement on the most challenging Humanoid
environment compared to SAC.
Related papers
- Multi-agent Off-policy Actor-Critic Reinforcement Learning for Partially Observable Environments [30.280532078714455]
This study proposes the use of a social learning method to estimate a global state within a multi-agent off-policy actor-critic algorithm for reinforcement learning.
We show that the difference between final outcomes, obtained when the global state is fully observed versus estimated through the social learning method, is $varepsilon$-bounded when an appropriate number of iterations of social learning updates are implemented.
arXiv Detail & Related papers (2024-07-06T06:51:14Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Why So Pessimistic? Estimating Uncertainties for Offline RL through
Ensembles, and Why Their Independence Matters [35.17151863463472]
We take a renewed look at how ensembles of $Q$-functions can be leveraged as the primary source of pessimism for offline reinforcement learning (RL)
We propose MSG, a practical offline RL algorithm that trains an ensemble of $Q$-functions with independently computed targets based on completely separate networks.
Our experiments on the popular D4RL and RL Unplugged offline RL benchmarks demonstrate that MSG with deep ensembles surpasses highly well-tuned state-of-the-art methods by a wide margin.
arXiv Detail & Related papers (2022-05-27T01:30:12Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning.
We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Combining Pessimism with Optimism for Robust and Efficient Model-Based
Deep Reinforcement Learning [56.17667147101263]
In real-world tasks, reinforcement learning agents encounter situations that are not present during training time.
To ensure reliable performance, the RL agents need to exhibit robustness against worst-case situations.
We propose the Robust Hallucinated Upper-Confidence RL (RH-UCRL) algorithm to provably solve this problem.
arXiv Detail & Related papers (2021-03-18T16:50:17Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - A Nonparametric Off-Policy Policy Gradient [32.35604597324448]
Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes.
We build on the general sample efficiency of off-policy algorithms.
We show that our approach has better sample efficiency than state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2020-01-08T10:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.