Online Estimation and Inference for Robust Policy Evaluation in
Reinforcement Learning
- URL: http://arxiv.org/abs/2310.02581v1
- Date: Wed, 4 Oct 2023 04:57:35 GMT
- Title: Online Estimation and Inference for Robust Policy Evaluation in
Reinforcement Learning
- Authors: Weidong Liu, Jiyuan Tu, Yichen Zhang, Xi Chen
- Abstract summary: We develop an online robust policy evaluation procedure, and establish the limiting distribution of our estimator, based on its Bahadur representation.
This paper bridges the gap between robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to policy evaluation.
- Score: 7.875680651592574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, reinforcement learning has gained prominence in modern statistics,
with policy evaluation being a key component. Unlike traditional machine
learning literature on this topic, our work places emphasis on statistical
inference for the parameter estimates computed using reinforcement learning
algorithms. While most existing analyses assume random rewards to follow
standard distributions, limiting their applicability, we embrace the concept of
robust statistics in reinforcement learning by simultaneously addressing issues
of outlier contamination and heavy-tailed rewards within a unified framework.
In this paper, we develop an online robust policy evaluation procedure, and
establish the limiting distribution of our estimator, based on its Bahadur
representation. Furthermore, we develop a fully-online procedure to efficiently
conduct statistical inference based on the asymptotic distribution. This paper
bridges the gap between robust statistics and statistical inference in
reinforcement learning, offering a more versatile and reliable approach to
policy evaluation. Finally, we validate the efficacy of our algorithm through
numerical experiments conducted in real-world reinforcement learning
experiments.
Related papers
- Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Positivity-free Policy Learning with Observational Data [8.293758599118618]
This study introduces a novel positivity-free (stochastic) policy learning framework.
We propose incremental propensity score policies to adjust propensity score values instead of assigning fixed values to treatments.
This paper provides a thorough exploration of the theoretical guarantees associated with policy learning and validates the proposed framework's finite-sample performance.
arXiv Detail & Related papers (2023-10-10T19:47:27Z) - Towards Theoretical Understanding of Data-Driven Policy Refinement [0.0]
This paper presents an approach for data-driven policy refinement in reinforcement learning, specifically designed for safety-critical applications.
Our principal contribution lies in the mathematical formulation of this data-driven policy refinement concept.
We present a series of theorems elucidating key theoretical properties of our approach, including convergence, robustness bounds, generalization error, and resilience to model mismatch.
arXiv Detail & Related papers (2023-05-11T13:36:21Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Online Bootstrap Inference For Policy Evaluation in Reinforcement
Learning [90.59143158534849]
The recent emergence of reinforcement learning has created a demand for robust statistical inference methods.
Existing methods for statistical inference in online learning are restricted to settings involving independently sampled observations.
The online bootstrap is a flexible and efficient approach for statistical inference in linear approximation algorithms, but its efficacy in settings involving Markov noise has yet to be explored.
arXiv Detail & Related papers (2021-08-08T18:26:35Z) - Bootstrapping Statistical Inference for Off-Policy Evaluation [43.79456564713911]
We study the use of bootstrapping in off-policy evaluation (OPE)
We propose a bootstrapping FQE method for inferring the distribution of the policy evaluation error and show that this method is efficient and consistent for off-policy statistical inference.
We evaluate the bootrapping method in classical RL environments for confidence interval estimation, estimating the variance of off-policy evaluator, and estimating the correlation between multiple off-policy evaluators.
arXiv Detail & Related papers (2021-02-06T16:45:33Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z) - Targeting Learning: Robust Statistics for Reproducible Research [1.1455937444848387]
Targeted Learning is a subfield of statistics that unifies advances in causal inference, machine learning and statistical theory to help answer scientifically impactful questions with statistical confidence.
The roadmap of Targeted Learning emphasizes tailoring statistical procedures so as to minimize their assumptions, carefully grounding them only in the scientific knowledge available.
arXiv Detail & Related papers (2020-06-12T17:17:01Z) - SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics.
We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations.
We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.