Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes
- URL: http://arxiv.org/abs/2209.08666v1
- Date: Sun, 18 Sep 2022 22:03:55 GMT
- Title: Offline Reinforcement Learning with Instrumental Variables in Confounded
Markov Decision Processes
- Authors: Zuyue Fu, Zhengling Qi, Zhaoran Wang, Zhuoran Yang, Yanxun Xu, Michael
R. Kosorok
- Abstract summary: We study the offline reinforcement learning (RL) in the face of unmeasured confounders.
We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
- Score: 93.61202366677526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the offline reinforcement learning (RL) in the face of unmeasured
confounders. Due to the lack of online interaction with the environment,
offline RL is facing the following two significant challenges: (i) the agent
may be confounded by the unobserved state variables; (ii) the offline data
collected a prior does not provide sufficient coverage for the environment. To
tackle the above challenges, we study the policy learning in the confounded
MDPs with the aid of instrumental variables. Specifically, we first establish
value function (VF)-based and marginalized importance sampling (MIS)-based
identification results for the expected total reward in the confounded MDPs.
Then by leveraging pessimism and our identification results, we propose various
policy learning methods with the finite-sample suboptimality guarantee of
finding the optimal in-class policy under minimal data coverage and modeling
assumptions. Lastly, our extensive theoretical investigations and one numerical
study motivated by the kidney transplantation demonstrate the promising
performance of the proposed methods.
Related papers
- Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms [34.593772931446125]
monograph focuses on the exploration of various model-based and model-free approaches for Constrained within the context of average reward Markov Decision Processes (MDPs)
The primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs.
arXiv Detail & Related papers (2024-06-17T12:46:02Z) - SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets [32.496818080222646]
We propose a new approach to model-based offline reinforcement learning.
We provide a theoretical guarantee of model uncertainty and performance bound of SeMOPO.
Experimental results show that our method substantially outperforms all baseline methods.
arXiv Detail & Related papers (2024-06-13T15:16:38Z) - Pessimistic Causal Reinforcement Learning with Mediators for Confounded Offline Data [17.991833729722288]
We propose a novel policy learning algorithm, PESsimistic CAusal Learning (PESCAL)
Our key observation is that, by incorporating auxiliary variables that mediate the effect of actions on system dynamics, it is sufficient to learn a lower bound of the mediator distribution function, instead of the Q-function.
We provide theoretical guarantees for the algorithms we propose, and demonstrate their efficacy through simulations, as well as real-world experiments utilizing offline datasets from a leading ride-hailing platform.
arXiv Detail & Related papers (2024-03-18T14:51:19Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Uncertainty Estimation by Fisher Information-based Evidential Deep
Learning [61.94125052118442]
Uncertainty estimation is a key factor that makes deep learning reliable in practical applications.
We propose a novel method, Fisher Information-based Evidential Deep Learning ($mathcalI$-EDL)
In particular, we introduce Fisher Information Matrix (FIM) to measure the informativeness of evidence carried by each sample, according to which we can dynamically reweight the objective loss terms to make the network more focused on the representation learning of uncertain classes.
arXiv Detail & Related papers (2023-03-03T16:12:59Z) - Distributionally Robust Model-Based Offline Reinforcement Learning with
Near-Optimal Sample Complexity [39.886149789339335]
offline reinforcement learning aims to learn to perform decision making from history data without active exploration.
Due to uncertainties and variabilities of the environment, it is critical to learn a robust policy that performs well even when the deployed environment deviates from the nominal one used to collect the history dataset.
We consider a distributionally robust formulation of offline RL, focusing on robust Markov decision processes with an uncertainty set specified by the Kullback-Leibler divergence in both finite-horizon and infinite-horizon settings.
arXiv Detail & Related papers (2022-08-11T11:55:31Z) - Pessimistic Q-Learning for Offline Reinforcement Learning: Towards
Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes.
A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z) - Reinforcement Learning with Heterogeneous Data: Estimation and Inference [84.72174994749305]
We introduce the K-Heterogeneous Markov Decision Process (K-Hetero MDP) to address sequential decision problems with population heterogeneity.
We propose the Auto-Clustered Policy Evaluation (ACPE) for estimating the value of a given policy, and the Auto-Clustered Policy Iteration (ACPI) for estimating the optimal policy in a given policy class.
We present simulations to support our theoretical findings, and we conduct an empirical study on the standard MIMIC-III dataset.
arXiv Detail & Related papers (2022-01-31T20:58:47Z) - Proximal Learning for Individualized Treatment Regimes Under Unmeasured
Confounding [3.020737957610002]
We develop approaches to estimating optimal individualized treatment regimes (ITRs) in the presence of unmeasured confounding.
Based on these results, we propose several classification-based approaches to finding a variety of restricted in-class optimal ITRs.
arXiv Detail & Related papers (2021-05-03T21:49:49Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.