The Importance of Pessimism in Fixed-Dataset Policy Optimization
- URL: http://arxiv.org/abs/2009.06799v3
- Date: Sun, 29 Nov 2020 05:58:30 GMT
- Title: The Importance of Pessimism in Fixed-Dataset Policy Optimization
- Authors: Jacob Buckman, Carles Gelada, Marc G. Bellemare
- Abstract summary: We study worst-case guarantees on the expected return of fixed-dataset policy optimization algorithms.
For naive approaches, the possibility of erroneous value overestimation leads to a difficult-to-satisfy requirement.
We show why pessimistic algorithms can achieve good performance even when the dataset is not informative of every policy.
- Score: 32.22700716592194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study worst-case guarantees on the expected return of fixed-dataset policy
optimization algorithms. Our core contribution is a unified conceptual and
mathematical framework for the study of algorithms in this regime. This
analysis reveals that for naive approaches, the possibility of erroneous value
overestimation leads to a difficult-to-satisfy requirement: in order to
guarantee that we select a policy which is near-optimal, we may need the
dataset to be informative of the value of every policy. To avoid this,
algorithms can follow the pessimism principle, which states that we should
choose the policy which acts optimally in the worst possible world. We show why
pessimistic algorithms can achieve good performance even when the dataset is
not informative of every policy, and derive families of algorithms which follow
this principle. These theoretical findings are validated by experiments on a
tabular gridworld, and deep learning experiments on four MinAtar environments.
Related papers
- Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - Bayesian Safe Policy Learning with Chance Constrained Optimization: Application to Military Security Assessment during the Vietnam War [0.0]
We investigate whether it would have been possible to improve a security assessment algorithm employed during the Vietnam War.
This empirical application raises several methodological challenges that frequently arise in high-stakes algorithmic decision-making.
arXiv Detail & Related papers (2023-07-17T20:59:50Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - On the Optimality of Batch Policy Optimization Algorithms [106.89498352537682]
Batch policy optimization considers leveraging existing data for policy construction before interacting with an environment.
We show that any confidence-adjusted index algorithm is minimax optimal, whether it be optimistic, pessimistic or neutral.
We introduce a new weighted-minimax criterion that considers the inherent difficulty of optimal value prediction.
arXiv Detail & Related papers (2021-04-06T05:23:20Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z) - Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
Globally Optimal Policy [95.98698822755227]
We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
arXiv Detail & Related papers (2020-12-28T05:02:26Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.