Provably Good Batch Reinforcement Learning Without Great Exploration
- URL: http://arxiv.org/abs/2007.08202v2
- Date: Wed, 22 Jul 2020 08:48:10 GMT
- Title: Provably Good Batch Reinforcement Learning Without Great Exploration
- Authors: Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill
- Abstract summary: Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
- Score: 51.51462608429621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Batch reinforcement learning (RL) is important to apply RL algorithms to many
high stakes tasks. Doing batch RL in a way that yields a reliable new policy in
large domains is challenging: a new decision policy may visit states and
actions outside the support of the batch data, and function approximation and
optimization with limited samples can further increase the potential of
learning policies with overly optimistic estimates of their future performance.
Recent algorithms have shown promise but can still be overly optimistic in
their expected outcomes. Theoretical work that provides strong guarantees on
the performance of the output policy relies on a strong concentrability
assumption, that makes it unsuitable for cases where the ratio between
state-action distributions of behavior policy and some candidate policies is
large. This is because in the traditional analysis, the error bound scales up
with this ratio. We show that a small modification to Bellman optimality and
evaluation back-up to take a more conservative update can have much stronger
guarantees. In certain settings, they can find the approximately best policy
within the state-action space explored by the batch data, without requiring a
priori assumptions of concentrability. We highlight the necessity of our
conservative update and the limitations of previous algorithms and analyses by
illustrative MDP examples, and demonstrate an empirical comparison of our
algorithm and other state-of-the-art batch RL baselines in standard benchmarks.
Related papers
- Beyond Expected Return: Accounting for Policy Reproducibility when
Evaluating Reinforcement Learning Algorithms [9.649114720478872]
Many applications in Reinforcement Learning (RL) have noise ority present in the environment.
These uncertainties lead the exact same policy to perform differently, from one roll-out to another.
Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution.
Our work defines this spread as the policy: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications.
arXiv Detail & Related papers (2023-12-12T11:22:31Z) - Importance-Weighted Offline Learning Done Right [16.4989952150404]
We study the problem of offline policy optimization in contextual bandit problems.
The goal is to learn a near-optimal policy based on a dataset of decision data collected by a suboptimal behavior policy.
We show that a simple alternative approach based on the "implicit exploration" estimator of citet2015 yields performance guarantees that are superior in nearly all possible terms to all previous results.
arXiv Detail & Related papers (2023-09-27T16:42:10Z) - STEEL: Singularity-aware Reinforcement Learning [14.424199399139804]
Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy.
We propose a new batch RL algorithm that allows for singularity for both state and action spaces.
By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm.
arXiv Detail & Related papers (2023-01-30T18:29:35Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment.
The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.
We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL)
We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA)
KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z) - BRPO: Batch Residual Policy Optimization [79.53696635382592]
In batch reinforcement learning, one often constrains a learned policy to be close to the behavior (data-generating) policy.
We propose residual policies, where the allowable deviation of the learned policy is state-action-dependent.
We derive a new for RL method, BRPO, which learns both the policy and allowable deviation that jointly maximize a lower bound on policy performance.
arXiv Detail & Related papers (2020-02-08T01:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.