A Sharp Characterization of Linear Estimators for Offline Policy
Evaluation
- URL: http://arxiv.org/abs/2203.04236v1
- Date: Tue, 8 Mar 2022 17:52:57 GMT
- Title: A Sharp Characterization of Linear Estimators for Offline Policy
Evaluation
- Authors: Juan C. Perdomo, Akshay Krishnamurthy, Peter Bartlett, Sham Kakade
- Abstract summary: offline policy evaluation is a fundamental statistical problem in reinforcement learning.
We identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods.
Our results provide a complete picture of the behavior of linear estimators for offline policy evaluation.
- Score: 33.37672297925897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline policy evaluation is a fundamental statistical problem in
reinforcement learning that involves estimating the value function of some
decision-making policy given data collected by a potentially different policy.
In order to tackle problems with complex, high-dimensional observations, there
has been significant interest from theoreticians and practitioners alike in
understanding the possibility of function approximation in reinforcement
learning. Despite significant study, a sharp characterization of when we might
expect offline policy evaluation to be tractable, even in the simplest setting
of linear function approximation, has so far remained elusive, with a
surprising number of strong negative results recently appearing in the
literature.
In this work, we identify simple control-theoretic and linear-algebraic
conditions that are necessary and sufficient for classical methods, in
particular Fitted Q-iteration (FQI) and least squares temporal difference
learning (LSTD), to succeed at offline policy evaluation. Using this
characterization, we establish a precise hierarchy of regimes under which these
estimators succeed. We prove that LSTD works under strictly weaker conditions
than FQI. Furthermore, we establish that if a problem is not solvable via LSTD,
then it cannot be solved by a broad class of linear estimators, even in the
limit of infinite data. Taken together, our results provide a complete picture
of the behavior of linear estimators for offline policy evaluation (OPE), unify
previously disparate analyses of canonical algorithms, and provide
significantly sharper notions of the underlying statistical complexity of OPE.
Related papers
- High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms.
We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z) - Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy.
We propose a doubly-robust inference procedure for quantile OPE in sequential decision making.
We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z) - Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems.
Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation.
We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z) - Near-optimal Offline Reinforcement Learning with Linear Representation:
Leveraging Variance Information with Pessimism [65.46524775457928]
offline reinforcement learning seeks to utilize offline/historical data to optimize sequential decision-making strategies.
We study the statistical limits of offline reinforcement learning with linear model representations.
arXiv Detail & Related papers (2022-03-11T09:00:12Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Offline Reinforcement Learning: Fundamental Barriers for Value Function
Approximation [74.3002974673248]
We consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data.
offline RL is becoming increasingly relevant in practice, because online data collection is well suited to safety-critical domains.
Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions that go beyond complexity learning.
arXiv Detail & Related papers (2021-11-21T23:22:37Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - What are the Statistical Limits of Offline RL with Linear Function
Approximation? [70.33301077240763]
offline reinforcement learning seeks to utilize offline (observational) data to guide the learning of sequential decision making strategies.
This work focuses on the basic question of what are necessary representational and distributional conditions that permit provable sample-efficient offline reinforcement learning.
arXiv Detail & Related papers (2020-10-22T17:32:13Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.