Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders
- URL: http://arxiv.org/abs/2007.13893v1
- Date: Mon, 27 Jul 2020 22:19:01 GMT
- Title: Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders
- Authors: Andrew Bennett, Nathan Kallus, Lihong Li, Ali Mousavi
- Abstract summary: We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
- Score: 62.54431888432302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy evaluation (OPE) in reinforcement learning is an important problem
in settings where experimentation is limited, such as education and healthcare.
But, in these very same settings, observed actions are often confounded by
unobserved variables making OPE even more difficult. We study an OPE problem in
an infinite-horizon, ergodic Markov decision process with unobserved
confounders, where states and actions can act as proxies for the unobserved
confounders. We show how, given only a latent variable model for states and
actions, policy value can be identified from off-policy data. Our method
involves two stages. In the first, we show how to use proxies to estimate
stationary distribution ratios, extending recent work on breaking the curse of
horizon to the confounded setting. In the second, we show optimal balancing can
be combined with such learned ratios to obtain policy value while avoiding
direct modeling of reward functions. We establish theoretical guarantees of
consistency, and benchmark our method empirically.
Related papers
- Deconfounding Imitation Learning with Variational Inference [19.99248795957195]
Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent.
This is because partial observability gives rise to hidden confounders in the causal graph.
We propose to train a variational inference model to infer the expert's latent information and use it to train a latent-conditional policy.
arXiv Detail & Related papers (2022-11-04T18:00:02Z) - Online Learning with Off-Policy Feedback [18.861989132159945]
We study the problem of online learning in adversarial bandit problems under a partial observability model called off-policy feedback.
We propose a set of algorithms that guarantee regret bounds that scale with a natural notion of mismatch between any comparator policy and the behavior policy.
arXiv Detail & Related papers (2022-07-18T21:57:16Z) - Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning.
We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z) - Deterministic and Discriminative Imitation (D2-Imitation): Revisiting
Adversarial Imitation for Sample Efficiency [61.03922379081648]
We propose an off-policy sample efficient approach that requires no adversarial training or min-max optimization.
Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation.
arXiv Detail & Related papers (2021-12-11T19:36:19Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - A Spectral Approach to Off-Policy Evaluation for POMDPs [8.613667867961034]
We consider off-policy evaluation in Partially Observable Markov Decision Processes.
Prior work on this problem uses a causal identification strategy based on one-step observable proxies of the hidden state.
In this work, we relax this requirement by using spectral methods and extending one-step proxies both into the past and future.
arXiv Detail & Related papers (2021-09-22T03:36:51Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment.
One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting.
But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient.
We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z) - Confounding-Robust Policy Evaluation in Infinite-Horizon Reinforcement
Learning [70.01650994156797]
Off- evaluation of sequential decision policies from observational data is necessary in batch reinforcement learning such as education healthcare.
We develop an approach that estimates the bounds of a given policy.
We prove convergence to the sharp bounds as we collect more confounded data.
arXiv Detail & Related papers (2020-02-11T16:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.