Scaling Marginalized Importance Sampling to High-Dimensional
State-Spaces via State Abstraction
- URL: http://arxiv.org/abs/2212.07486v1
- Date: Wed, 14 Dec 2022 20:07:33 GMT
- Title: Scaling Marginalized Importance Sampling to High-Dimensional
State-Spaces via State Abstraction
- Authors: Brahma S. Pavse and Josiah P. Hanna
- Abstract summary: We consider the problem of off-policy evaluation in reinforcement learning (RL)
We propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space.
- Score: 5.150752343250592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of off-policy evaluation (OPE) in reinforcement
learning (RL), where the goal is to estimate the performance of an evaluation
policy, $\pi_e$, using a fixed dataset, $\mathcal{D}$, collected by one or more
policies that may be different from $\pi_e$. Current OPE algorithms may produce
poor OPE estimates under policy distribution shift i.e., when the probability
of a particular state-action pair occurring under $\pi_e$ is very different
from the probability of that same pair occurring in $\mathcal{D}$ (Voloshin et
al. 2021, Fu et al. 2021). In this work, we propose to improve the accuracy of
OPE estimators by projecting the high-dimensional state-space into a
low-dimensional state-space using concepts from the state abstraction
literature. Specifically, we consider marginalized importance sampling (MIS)
OPE algorithms which compute state-action distribution correction ratios to
produce their OPE estimate. In the original ground state-space, these ratios
may have high variance which may lead to high variance OPE. However, we prove
that in the lower-dimensional abstract state-space the ratios can have lower
variance resulting in lower variance OPE. We then highlight the challenges that
arise when estimating the abstract ratios from data, identify sufficient
conditions to overcome these issues, and present a minimax optimization problem
whose solution yields these abstract ratios. Finally, our empirical evaluation
on difficult, high-dimensional state-space OPE tasks shows that the abstract
ratios can make MIS OPE estimators achieve lower mean-squared error and more
robust to hyperparameter tuning than the ground ratios.
Related papers
- Rejection via Learning Density Ratios [50.91522897152437]
Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions.
We propose a different distributional perspective, where we seek to find an idealized data distribution which maximizes a pretrained model's performance.
Our framework is tested empirically over clean and noisy datasets.
arXiv Detail & Related papers (2024-05-29T01:32:17Z) - A Finite-Horizon Approach to Active Level Set Estimation [0.7366405857677227]
We consider the problem of active learning in the context of spatial sampling for level set estimation (LSE)
We present a finite-horizon search procedure to perform LSE in one dimension while optimally balancing both the final estimation error and the distance traveled for a fixed number of samples.
We show that the resulting optimization problem can be solved in closed form and that the resulting policy generalizes existing approaches to this problem.
arXiv Detail & Related papers (2023-10-18T14:11:41Z) - Nearly Optimal Latent State Decoding in Block MDPs [74.51224067640717]
In episodic Block MDPs, the decision maker has access to rich observations or contexts generated from a small number of latent states.
We are first interested in estimating the latent state decoding function based on data generated under a fixed behavior policy.
We then study the problem of learning near-optimal policies in the reward-free framework.
arXiv Detail & Related papers (2022-08-17T18:49:53Z) - Sample Complexity of Nonparametric Off-Policy Evaluation on
Low-Dimensional Manifolds using Deep Networks [71.95722100511627]
We consider the off-policy evaluation problem of reinforcement learning using deep neural networks.
We show that, by choosing network size appropriately, one can leverage the low-dimensional manifold structure in the Markov decision process.
arXiv Detail & Related papers (2022-06-06T20:25:20Z) - Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning [59.02006924867438]
Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions.
Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting.
We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
arXiv Detail & Related papers (2022-02-19T20:00:44Z) - Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium
Learning from Offline Datasets [101.5329678997916]
We study episodic two-player zero-sum Markov games (MGs) in the offline setting.
The goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori.
arXiv Detail & Related papers (2022-02-15T15:39:30Z) - SOPE: Spectrum of Off-Policy Estimators [40.15700429288981]
We show the existence of a spectrum of estimators whose endpoints are SIS and IS.
We provide empirical evidence that estimators in this spectrum can be used to trade-off between the bias and variance of IS and SIS.
arXiv Detail & Related papers (2021-11-06T18:29:21Z) - Measuring Model Fairness under Noisy Covariates: A Theoretical
Perspective [26.704446184314506]
We study the problem of measuring the fairness of a machine learning model under noisy information.
We present a theoretical analysis that aims to characterize weaker conditions under which accurate fairness evaluation is possible.
arXiv Detail & Related papers (2021-05-20T18:36:28Z) - Provably Good Batch Reinforcement Learning Without Great Exploration [51.51462608429621]
Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks.
Recent algorithms have shown promise but can still be overly optimistic in their expected outcomes.
We show that a small modification to Bellman optimality and evaluation back-up to take a more conservative update can have much stronger guarantees.
arXiv Detail & Related papers (2020-07-16T09:25:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.