Matrix Estimation for Offline Reinforcement Learning with Low-Rank
Structure
- URL: http://arxiv.org/abs/2305.15621v1
- Date: Wed, 24 May 2023 23:49:06 GMT
- Title: Matrix Estimation for Offline Reinforcement Learning with Low-Rank
Structure
- Authors: Xumei Xi, Christina Lee Yu, Yudong Chen
- Abstract summary: We consider offline Reinforcement Learning (RL), where the agent does not interact with the environment and must rely on offline data collected using a behavior policy.
Previous works provide policy evaluation guarantees when the target policy to be evaluated is covered by the behavior policy.
We propose an offline policy evaluation algorithm that leverages the low-rank structure to estimate the values of uncovered state-action pairs.
- Score: 10.968373699696455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider offline Reinforcement Learning (RL), where the agent does not
interact with the environment and must rely on offline data collected using a
behavior policy. Previous works provide policy evaluation guarantees when the
target policy to be evaluated is covered by the behavior policy, that is,
state-action pairs visited by the target policy must also be visited by the
behavior policy. We show that when the MDP has a latent low-rank structure,
this coverage condition can be relaxed. Building on the connection to weighted
matrix completion with non-uniform observations, we propose an offline policy
evaluation algorithm that leverages the low-rank structure to estimate the
values of uncovered state-action pairs. Our algorithm does not require a known
feature representation, and our finite-sample error bound involves a novel
discrepancy measure quantifying the discrepancy between the behavior and target
policies in the spectral space. We provide concrete examples where our
algorithm achieves accurate estimation while existing coverage conditions are
not satisfied. Building on the above evaluation algorithm, we further design an
offline policy optimization algorithm and provide non-asymptotic performance
guarantees.
Related papers
- Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Hallucinated Adversarial Control for Conservative Offline Policy
Evaluation [64.94009515033984]
We study the problem of conservative off-policy evaluation (COPE) where given an offline dataset of environment interactions, we seek to obtain a (tight) lower bound on a policy's performance.
We introduce HAMBO, which builds on an uncertainty-aware learned model of the transition dynamics.
We prove that the resulting COPE estimates are valid lower bounds, and, under regularity conditions, show their convergence to the true expected return.
arXiv Detail & Related papers (2023-03-02T08:57:35Z) - Offline Policy Evaluation and Optimization under Confounding [35.778917456294046]
We map out the landscape of offline policy evaluation for confounded MDPs.
We characterize settings where consistent value estimates are provably not achievable.
We present new algorithms for offline policy improvement and prove local convergence guarantees.
arXiv Detail & Related papers (2022-11-29T20:45:08Z) - A Sharp Characterization of Linear Estimators for Offline Policy
Evaluation [33.37672297925897]
offline policy evaluation is a fundamental statistical problem in reinforcement learning.
We identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods.
Our results provide a complete picture of the behavior of linear estimators for offline policy evaluation.
arXiv Detail & Related papers (2022-03-08T17:52:57Z) - Non-asymptotic Confidence Intervals of Off-policy Evaluation: Primal and
Dual Bounds [21.520045697447372]
Off-policy evaluation (OPE) is the task of estimating the expected reward of a given policy based on offline data previously collected under different policies.
This work considers the problem of constructing non-asymptotic confidence intervals in infinite-horizon off-policy evaluation.
We develop a practical algorithm through a primal-dual optimization-based approach.
arXiv Detail & Related papers (2021-03-09T22:31:20Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic
Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous.
In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist.
We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.