Why So Pessimistic? Estimating Uncertainties for Offline RL through
Ensembles, and Why Their Independence Matters
- URL: http://arxiv.org/abs/2205.13703v1
- Date: Fri, 27 May 2022 01:30:12 GMT
- Title: Why So Pessimistic? Estimating Uncertainties for Offline RL through
Ensembles, and Why Their Independence Matters
- Authors: Seyed Kamyar Seyed Ghasemipour, Shixiang Shane Gu, Ofir Nachum
- Abstract summary: We take a renewed look at how ensembles of $Q$-functions can be leveraged as the primary source of pessimism for offline reinforcement learning (RL)
We propose MSG, a practical offline RL algorithm that trains an ensemble of $Q$-functions with independently computed targets based on completely separate networks.
Our experiments on the popular D4RL and RL Unplugged offline RL benchmarks demonstrate that MSG with deep ensembles surpasses highly well-tuned state-of-the-art methods by a wide margin.
- Score: 35.17151863463472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the success of ensembles for uncertainty estimation in
supervised learning, we take a renewed look at how ensembles of $Q$-functions
can be leveraged as the primary source of pessimism for offline reinforcement
learning (RL). We begin by identifying a critical flaw in a popular algorithmic
choice used by many ensemble-based RL algorithms, namely the use of shared
pessimistic target values when computing each ensemble member's Bellman error.
Through theoretical analyses and construction of examples in toy MDPs, we
demonstrate that shared pessimistic targets can paradoxically lead to value
estimates that are effectively optimistic. Given this result, we propose MSG, a
practical offline RL algorithm that trains an ensemble of $Q$-functions with
independently computed targets based on completely separate networks, and
optimizes a policy with respect to the lower confidence bound of predicted
action values. Our experiments on the popular D4RL and RL Unplugged offline RL
benchmarks demonstrate that on challenging domains such as antmazes, MSG with
deep ensembles surpasses highly well-tuned state-of-the-art methods by a wide
margin. Additionally, through ablations on benchmarks domains, we verify the
critical significance of using independently trained $Q$-functions, and study
the role of ensemble size. Finally, as using separate networks per ensemble
member can become computationally costly with larger neural network
architectures, we investigate whether efficient ensemble approximations
developed for supervised learning can be similarly effective, and demonstrate
that they do not match the performance and robustness of MSG with separate
networks, highlighting the need for new efforts into efficient uncertainty
estimation directed at RL.
Related papers
- VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates.
We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - UDQL: Bridging The Gap between MSE Loss and The Optimal Value Function in Offline Reinforcement Learning [10.593924216046977]
We first theoretically analyze overestimation phenomenon led by MSE and provide the theoretical upper bound of the overestimated error.
At last, we propose the offline RL algorithm based on underestimated operator and diffusion policy model.
arXiv Detail & Related papers (2024-06-05T14:37:42Z) - LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks [52.46420522934253]
We introduce LoRA-Ensemble, a parameter-efficient deep ensemble method for self-attention networks.
By employing a single pre-trained self-attention network with weights shared across all members, we train member-specific low-rank matrices for the attention projections.
Our method exhibits superior calibration compared to explicit ensembles and achieves similar or better accuracy across various prediction tasks and datasets.
arXiv Detail & Related papers (2024-05-23T11:10:32Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Neural Network Approximation for Pessimistic Offline Reinforcement
Learning [17.756108291816908]
We present a non-asymptotic estimation error of pessimistic offline RL using general neural network approximation.
Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight.
arXiv Detail & Related papers (2023-12-19T05:17:27Z) - SMORE: Score Models for Offline Goal-Conditioned Reinforcement Learning [33.125187822259186]
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with learning to achieve multiple goals in an environment purely from offline datasets using sparse reward functions.
We present a novel approach to GCRL under a new lens of mixture-distribution matching, leading to our discriminator-free method: SMORe.
arXiv Detail & Related papers (2023-11-03T16:19:33Z) - Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories.
We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z) - Deep Negative Correlation Classification [82.45045814842595]
Existing deep ensemble methods naively train many different models and then aggregate their predictions.
We propose deep negative correlation classification (DNCC)
DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated.
arXiv Detail & Related papers (2022-12-14T07:35:20Z) - Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL)
We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions.
Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z) - Uncertainty-Based Offline Reinforcement Learning with Diversified
Q-Ensemble [16.92791301062903]
We propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution.
Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning.
arXiv Detail & Related papers (2021-10-04T16:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.