Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference
- URL: http://arxiv.org/abs/2501.06926v2
- Date: Sun, 27 Apr 2025 21:06:09 GMT
- Title: Automatic Double Reinforcement Learning in Semiparametric Markov Decision Processes with Applications to Long-Term Causal Inference
- Authors: Lars van der Laan, David Hubbard, Allen Tran, Nathan Kallus, Aurélien Bibaut,
- Abstract summary: Markov Decision Processes (MDPs) offer a principled framework for modeling outcomes as sequences of states, actions, and rewards over time.<n>We introduce a semiparametric extension of Double Reinforcement Learning (DRL) for statistically efficient, model-robust inference on linear functionals of the Q-function.<n>We develop a novel debiased plug-in estimator based on isotonic Bellman calibration, which integrates fitted Q-it with an isotonic regression step.
- Score: 33.14076284663493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Estimating long-term causal effects from short-term data is essential for decision-making in healthcare, economics, and industry, where long-term follow-up is often infeasible. Markov Decision Processes (MDPs) offer a principled framework for modeling outcomes as sequences of states, actions, and rewards over time. We introduce a semiparametric extension of Double Reinforcement Learning (DRL) for statistically efficient, model-robust inference on linear functionals of the Q-function, such as policy values, in infinite-horizon, time-homogeneous MDPs. By imposing semiparametric structure on the Q-function, our method relaxes the strong state overlap assumptions required by fully nonparametric approaches, improving efficiency and stability. To address computational and robustness challenges of minimax nuisance estimation, we develop a novel debiased plug-in estimator based on isotonic Bellman calibration, which integrates fitted Q-iteration with an isotonic regression step. This procedure leverages the Q-function as a data-driven dimension reduction, debiases all linear functionals of interest simultaneously, and enables nonparametric inference without explicit nuisance function estimation. Bellman calibration generalizes isotonic calibration to MDPs and may be of independent interest for prediction in reinforcement learning. Finally, we show that model selection for the Q-function incurs only second-order bias and extend the adaptive debiased machine learning (ADML) framework to MDPs for data-driven learning of semiparametric structure.
Related papers
- Automatic Debiased Machine Learning for Smooth Functionals of Nonparametric M-Estimands [34.30497962430375]
We propose a unified framework for automatic debiased machine learning (autoDML) to perform inference on smooth functionals of infinite-dimensional M-estimands.
We introduce three autoDML estimators based on one-step estimation, targeted minimum loss-based estimation, and the method of sieves.
For data-driven model selection, we derive a novel decomposition of model approximation error for smooth functionals of M-estimands.
arXiv Detail & Related papers (2025-01-21T03:50:51Z) - Semiparametric inference for impulse response functions using double/debiased machine learning [49.1574468325115]
We introduce a machine learning estimator for the impulse response function (IRF) in settings where a time series of interest is subjected to multiple discrete treatments.
The proposed estimator can rely on fully nonparametric relations between treatment and outcome variables, opening up the possibility to use flexible machine learning approaches to estimate IRFs.
arXiv Detail & Related papers (2024-11-15T07:42:02Z) - Automatic doubly robust inference for linear functionals via calibrated debiased machine learning [0.9694940903078658]
We propose a debiased machine learning estimator for doubly robust inference.
A C-DML estimator maintains linearity when either the outcome regression or the Riesz representer of the linear functional is estimated sufficiently well.
Our theoretical and empirical results support the use of C-DML to mitigate bias arising from the inconsistent or slow estimation of nuisance functions.
arXiv Detail & Related papers (2024-11-05T03:32:30Z) - Efficient and Sharp Off-Policy Evaluation in Robust Markov Decision Processes [44.974100402600165]
We study the evaluation of a policy best-parametric and worst-case perturbations to a decision process (MDP)
We use transition observations from the original MDP, whether they are generated under the same or a different policy.
Our estimator is also estimated statistical inference using Wald confidence intervals.
arXiv Detail & Related papers (2024-03-29T18:11:49Z) - Online non-parametric likelihood-ratio estimation by Pearson-divergence
functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time.
We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z) - FAStEN: An Efficient Adaptive Method for Feature Selection and Estimation in High-Dimensional Functional Regressions [7.674715791336311]
We propose a new, flexible and ultra-efficient approach to perform feature selection in a sparse function-on-function regression problem.
We show how to extend it to the scalar-on-function framework.
We present an application to brain fMRI data from the AOMIC PIOP1 study.
arXiv Detail & Related papers (2023-03-26T19:41:17Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - GEC: A Unified Framework for Interactive Decision Making in MDP, POMDP,
and Beyond [101.5329678997916]
We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making.
We propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation.
We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR.
arXiv Detail & Related papers (2022-11-03T16:42:40Z) - Implicitly Regularized RL with Implicit Q-Values [42.87920755961722]
The $Q$-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy.
We propose to parametrize the $Q$-function implicitly, as the sum of a log-policy and of a value function.
We derive a practical off-policy deep RL algorithm, suitable for large action spaces, and that enforces the softmax relation between the policy and the $Q$-value.
arXiv Detail & Related papers (2021-08-16T12:20:47Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - Exploiting Submodular Value Functions For Scaling Up Active Perception [60.81276437097671]
In active perception tasks, agent aims to select sensory actions that reduce uncertainty about one or more hidden variables.
Partially observable Markov decision processes (POMDPs) provide a natural model for such problems.
As the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially.
arXiv Detail & Related papers (2020-09-21T09:11:36Z) - Learning Stable Nonparametric Dynamical Systems with Gaussian Process
Regression [9.126353101382607]
We learn a nonparametric Lyapunov function based on Gaussian process regression from data.
We prove that stabilization of the nominal model based on the nonparametric control Lyapunov function does not modify the behavior of the nominal model at training samples.
arXiv Detail & Related papers (2020-06-14T11:17:17Z) - Machine learning for causal inference: on the use of cross-fit
estimators [77.34726150561087]
Doubly-robust cross-fit estimators have been proposed to yield better statistical properties.
We conducted a simulation study to assess the performance of several estimators for the average causal effect (ACE)
When used with machine learning, the doubly-robust cross-fit estimators substantially outperformed all of the other estimators in terms of bias, variance, and confidence interval coverage.
arXiv Detail & Related papers (2020-04-21T23:09:55Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z) - Localized Debiased Machine Learning: Efficient Inference on Quantile
Treatment Effects and Beyond [69.83813153444115]
We consider an efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference.
Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances.
We propose localized debiased machine learning (LDML), which avoids this burdensome step.
arXiv Detail & Related papers (2019-12-30T14:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.