Related papers: Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings

Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings

URL: http://arxiv.org/abs/2001.04515v2
Date: Sun, 20 Jun 2021 20:28:50 GMT
Title: Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings
Authors: C. Shi, S. Zhang, W. Lu and R. Song
Abstract summary: We construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. We show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. We propose to model the action-value state function (Q-function) associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a SequentiAl Value Evaluation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. Simulation studies are conducted to back up our theoretical findings. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status. A Python implementation of the proposed procedure is available at https://github.com/shengzhang37/SAVE.

Related papers

Kernel Metric Learning for In-Sample Off-Policy Evaluation of Deterministic RL Policies [24.706986328622193]
We consider off-policy evaluation of deterministic target policies for reinforcement learning. We learn the kernel metrics that minimize the overall mean squared error of the estimated temporal difference update vector of an action value function. We derive the bias and variance of the estimation error due to this relaxation and provide analytic solutions for the optimal kernel metric.
arXiv Detail & Related papers (2024-05-29T06:17:33Z)
Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. We propose a doubly-robust inference procedure for quantile OPE in sequential decision making. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z)
Chaining Value Functions for Off-Policy Learning [22.54793586116019]
We discuss a novel family of off-policy prediction algorithms which are convergent by construction. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
arXiv Detail & Related papers (2022-01-17T15:26:47Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Deeply-Debiased Off-Policy Interval Estimation [11.683223078990325]
Off-policy evaluation learns a target policy's value with a historical dataset generated by a different behavior policy. Many applications would benefit significantly from having a confidence interval (CI) that quantifies the uncertainty of the point estimate. We propose a novel procedure to construct an efficient, robust, and flexible CI on a target policy's value.
arXiv Detail & Related papers (2021-05-10T20:00:08Z)
Policy Gradient for Continuing Tasks in Non-stationary Markov Decision Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities. We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy. A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation. Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare. We develop an approach that estimates the bounds of a given policy. We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.