Online Bootstrap Inference For Policy Evaluation in Reinforcement
Learning
- URL: http://arxiv.org/abs/2108.03706v1
- Date: Sun, 8 Aug 2021 18:26:35 GMT
- Title: Online Bootstrap Inference For Policy Evaluation in Reinforcement
Learning
- Authors: Pratik Ramprasad, Yuantong Li, Zhuoran Yang, Zhaoran Wang, Will Wei
Sun, Guang Cheng
- Abstract summary: The recent emergence of reinforcement learning has created a demand for robust statistical inference methods.
Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations.
The online bootstrap is a flexible and efficient approach for statistical inference in linear approximation algorithms, but its efficacy in settings involving Markov noise has yet to be explored.
- Score: 90.59143158534849
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent emergence of reinforcement learning has created a demand for
robust statistical inference methods for the parameter estimates computed using
these algorithms. Existing methods for statistical inference in online learning
are restricted to settings involving independently sampled observations, while
existing statistical inference methods in reinforcement learning (RL) are
limited to the batch setting. The online bootstrap is a flexible and efficient
approach for statistical inference in linear stochastic approximation
algorithms, but its efficacy in settings involving Markov noise, such as RL,
has yet to be explored. In this paper, we study the use of the online bootstrap
method for statistical inference in RL. In particular, we focus on the temporal
difference (TD) learning and Gradient TD (GTD) learning algorithms, which are
themselves special instances of linear stochastic approximation under Markov
noise. The method is shown to be distributionally consistent for statistical
inference in policy evaluation, and numerical experiments are included to
demonstrate the effectiveness of this algorithm at statistical inference tasks
across a range of real RL environments.
Related papers
- Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Online Estimation and Inference for Robust Policy Evaluation in
Reinforcement Learning [7.875680651592574]
We develop an online robust policy evaluation procedure, and establish the limiting distribution of our estimator, based on its Bahadur representation.
This paper bridges the gap between robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to policy evaluation.
arXiv Detail & Related papers (2023-10-04T04:57:35Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Finite-Time Analysis of Temporal Difference Learning: Discrete-Time
Linear System Perspective [3.5823366350053325]
TD-learning is a fundamental algorithm in the field of reinforcement learning (RL)
Recent research has uncovered guarantees concerning its statistical efficiency by developing finite-time error bounds.
arXiv Detail & Related papers (2022-04-22T03:21:30Z) - Fast and Robust Online Inference with Stochastic Gradient Descent via
Random Scaling [0.9806910643086042]
We develop a new method of online inference for a vector of parameters estimated by the Polyak-Rtupper averaging procedure of gradient descent algorithms.
Our approach is fully operational with online data and is rigorously underpinned by a functional central limit theorem.
arXiv Detail & Related papers (2021-06-06T15:38:37Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - DEALIO: Data-Efficient Adversarial Learning for Imitation from
Observation [57.358212277226315]
In imitation learning from observation IfO, a learning agent seeks to imitate a demonstrating agent using only observations of the demonstrated behavior without access to the control signals generated by the demonstrator.
Recent methods based on adversarial imitation learning have led to state-of-the-art performance on IfO problems, but they typically suffer from high sample complexity due to a reliance on data-inefficient, model-free reinforcement learning algorithms.
This issue makes them impractical to deploy in real-world settings, where gathering samples can incur high costs in terms of time, energy, and risk.
We propose a more data-efficient IfO algorithm
arXiv Detail & Related papers (2021-03-31T23:46:32Z) - Bootstrapping Statistical Inference for Off-Policy Evaluation [43.79456564713911]
We study the use of bootstrapping in off-policy evaluation (OPE)
We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is efficient and consistent for off-policy statistical inference.
We evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
arXiv Detail & Related papers (2021-02-06T16:45:33Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.