Beyond the Return: Off-policy Function Estimation under User-specified
Error-measuring Distributions
- URL: http://arxiv.org/abs/2210.15543v1
- Date: Thu, 27 Oct 2022 15:34:17 GMT
- Title: Beyond the Return: Off-policy Function Estimation under User-specified
Error-measuring Distributions
- Authors: Audrey Huang, Nan Jiang
- Abstract summary: Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function.
We provide guarantees for off-policy function estimation under only realizability, by imposing proper regularization on the marginalized importance sampling objectives.
- Score: 8.881195152638986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-policy evaluation often refers to two related tasks: estimating the
expected return of a policy and estimating its value function (or other
functions of interest, such as density ratios). While recent works on
marginalized importance sampling (MIS) show that the former can enjoy provable
guarantees under realizable function approximation, the latter is only known to
be feasible under much stronger assumptions such as prohibitively expressive
discriminators. In this work, we provide guarantees for off-policy function
estimation under only realizability, by imposing proper regularization on the
MIS objectives. Compared to commonly used regularization in MIS, our
regularizer is much more flexible and can account for an arbitrary
user-specified distribution, under which the learned function will be close to
the groundtruth. We provide exact characterization of the optimal dual solution
that needs to be realized by the discriminator class, which determines the
data-coverage assumption in the case of value-function learning. As another
surprising observation, the regularizer can be altered to relax the
data-coverage requirement, and completely eliminate it in the ideal case with
strong side information.
Related papers
- Beyond Expected Return: Accounting for Policy Reproducibility when
Evaluating Reinforcement Learning Algorithms [9.649114720478872]
Many applications in Reinforcement Learning (RL) have noise ority present in the environment.
These uncertainties lead the exact same policy to perform differently, from one roll-out to another.
Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution.
Our work defines this spread as the policy: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications.
arXiv Detail & Related papers (2023-12-12T11:22:31Z) - Provable Offline Preference-Based Reinforcement Learning [95.00042541409901]
We investigate the problem of offline Preference-based Reinforcement Learning (PbRL) with human feedback.
We consider the general reward setting where the reward can be defined over the whole trajectory.
We introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability.
arXiv Detail & Related papers (2023-05-24T07:11:26Z) - Offline Reinforcement Learning with Additional Covering Distributions [0.0]
We study learning optimal policies from a logged dataset, i.e., offline RL, with function approximation.
We show that sample-efficient offline RL for general MDPs is possible with only a partial coverage dataset and weak realizable function classes.
arXiv Detail & Related papers (2023-05-22T03:31:03Z) - On the Efficacy of Generalization Error Prediction Scoring Functions [33.24980750651318]
Generalization error predictors (GEPs) aim to predict model performance on unseen distributions by deriving dataset-level error estimates from sample-level scores.
We rigorously study the effectiveness of popular scoring functions (confidence, local manifold smoothness, model agreement) independent of mechanism choice.
arXiv Detail & Related papers (2023-03-23T18:08:44Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - GenDICE: Generalized Offline Estimation of Stationary Values [108.17309783125398]
We show that effective estimation can still be achieved in important applications.
Our approach is based on estimating a ratio that corrects for the discrepancy between the stationary and empirical distributions.
The resulting algorithm, GenDICE, is straightforward and effective.
arXiv Detail & Related papers (2020-02-21T00:27:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.