Variance-Aware Off-Policy Evaluation with Linear Function Approximation
- URL: http://arxiv.org/abs/2106.11960v1
- Date: Tue, 22 Jun 2021 17:58:46 GMT
- Title: Variance-Aware Off-Policy Evaluation with Linear Function Approximation
- Authors: Yifei Min and Tianhao Wang and Dongruo Zhou and Quanquan Gu
- Abstract summary: We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
- Score: 85.75516599931632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the off-policy evaluation (OPE) problem in reinforcement learning
with linear function approximation, which aims to estimate the value function
of a target policy based on the offline data collected by a behavior policy. We
propose to incorporate the variance information of the value function to
improve the sample efficiency of OPE. More specifically, for time-inhomogeneous
episodic linear Markov decision processes (MDPs), we propose an algorithm,
VA-OPE, which uses the estimated variance of the value function to reweight the
Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a
tighter error bound than the best-known result. We also provide a fine-grained
characterization of the distribution shift between the behavior policy and the
target policy. Extensive numerical experiments corroborate our theory.
Related papers
- Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies [24.706986328622193]
We consider off-policy evaluation of deterministic target policies for reinforcement learning.
We learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function.
We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric.
arXiv Detail & Related papers (2024-05-29T06:17:33Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - Doubly-Robust Off-Policy Evaluation with Estimated Logging Policy [11.16777821381608]
We introduce a novel doubly-robust (DR) off-policy estimator for Markov decision processes, DRUnknown, designed for situations where both the logging policy and the value function are unknown.
The proposed estimator initially estimates the logging policy and then estimates the value function model by minimizing the variance of the estimator while considering the estimating effect of the logging policy.
arXiv Detail & Related papers (2024-04-02T10:42:44Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.