On Instrumental Variable Regression for Deep Offline Policy Evaluation
- URL: http://arxiv.org/abs/2105.10148v1
- Date: Fri, 21 May 2021 06:22:34 GMT
- Title: On Instrumental Variable Regression for Deep Offline Policy Evaluation
- Authors: Yutian Chen, Liyuan Xu, Caglar Gulcehre, Tom Le Paine, Arthur Gretton,
Nando de Freitas, Arnaud Doucet
- Abstract summary: We show that the popular reinforcement learning strategy of estimating the state-action value (Q-function) by minimizing the mean squared Bellman error leads to a regression problem with confounding.
We explain why fixing the target Q-network in Deep Q-Networks and Fitted Q Evaluation provides a way of overcoming this confounding.
This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation.
- Score: 37.05492059049681
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We show that the popular reinforcement learning (RL) strategy of estimating
the state-action value (Q-function) by minimizing the mean squared Bellman
error leads to a regression problem with confounding, the inputs and output
noise being correlated. Hence, direct minimization of the Bellman error can
result in significantly biased Q-function estimates. We explain why fixing the
target Q-network in Deep Q-Networks and Fitted Q Evaluation provides a way of
overcoming this confounding, thus shedding new light on this popular but not
well understood trick in the deep RL literature. An alternative approach to
address confounding is to leverage techniques developed in the causality
literature, notably instrumental variables (IV). We bring together here the
literature on IV and RL by investigating whether IV approaches can lead to
improved Q-function estimates. This paper analyzes and compares a wide range of
recent IV methods in the context of offline policy evaluation (OPE), where the
goal is to estimate the value of a policy using logged data only. By applying
different IV techniques to OPE, we are not only able to recover previously
proposed OPE methods such as model-based techniques but also to obtain
competitive new techniques. We find empirically that state-of-the-art OPE
methods are closely matched in performance by some IV methods such as AGMM,
which were not developed for OPE. We open-source all our code and datasets at
https://github.com/liyuan9988/IVOPEwithACME.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Learning Decision Policies with Instrumental Variables through Double Machine Learning [16.842233444365764]
A common issue in learning decision-making policies in data-rich settings is spurious correlations in the offline dataset.
We propose DML-IV, a non-linear IV regression method that reduces the bias in two-stage IV regressions.
It outperforms state-of-the-art IV regression methods on IV regression benchmarks and learns high-performing policies in the presence of instruments.
arXiv Detail & Related papers (2024-05-14T10:55:04Z) - Regularized DeepIV with Model Selection [72.17508967124081]
Regularized DeepIV (RDIV) regression can converge to the least-norm IV solution.
Our method matches the current state-of-the-art convergence rate.
arXiv Detail & Related papers (2024-03-07T05:38:56Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Uncertainty-Based Offline Reinforcement Learning with Diversified
Q-Ensemble [16.92791301062903]
We propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution.
Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning.
arXiv Detail & Related papers (2021-10-04T16:40:13Z) - Scalable Quasi-Bayesian Inference for Instrumental Variable Regression [40.33643110066981]
We present a scalable quasi-Bayesian procedure for IV regression, building upon the recently developed kernelized IV models.
Our approach does not require additional assumptions on the data generating process, and leads to a scalable approximate inference algorithm with time cost comparable to the corresponding point estimation methods.
arXiv Detail & Related papers (2021-06-16T12:52:19Z) - On Finite-Sample Analysis of Offline Reinforcement Learning with Deep
ReLU Networks [46.067702683141356]
We study the statistical theory of offline reinforcement learning with deep ReLU networks.
We quantify how the distribution shift of the offline data, the dimension of the input space, and the regularity of the system control the OPE estimation error.
arXiv Detail & Related papers (2021-03-11T14:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.