Related papers: Asymptotic Inference for Multi-Stage Stationary Treatment Policy with Variable Selection

Asymptotic Inference for Multi-Stage Stationary Treatment Policy with Variable Selection

URL: http://arxiv.org/abs/2301.12553v3
Date: Wed, 08 Jan 2025 16:02:24 GMT
Title: Asymptotic Inference for Multi-Stage Stationary Treatment Policy with Variable Selection
Authors: Daiqi Gao, Yufeng Liu, Donglin Zeng,
Abstract summary: Dynamic treatment regimes or policies are a sequence of decision functions over multiple stages that are tailored to individual features.<n>One important class of treatment policies in practice, namely multi-stage stationary treatment policies, prescribes treatment assignment using the same decision function across stages.<n>Although there has been extensive literature on constructing valid inference for the value function associated with dynamic treatment policies, little work has focused on the policies themselves.
Score: 13.202945240520986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dynamic treatment regimes or policies are a sequence of decision functions over multiple stages that are tailored to individual features. One important class of treatment policies in practice, namely multi-stage stationary treatment policies, prescribes treatment assignment probabilities using the same decision function across stages, where the decision is based on the same set of features consisting of time-evolving variables (e.g., routinely collected disease biomarkers). Although there has been extensive literature on constructing valid inference for the value function associated with dynamic treatment policies, little work has focused on the policies themselves, especially in the presence of high-dimensional feature variables. We aim to fill the gap in this work. Specifically, we first estimate the multi-stage stationary treatment policy using an augmented inverse probability weighted estimator for the value function to increase asymptotic efficiency, and further apply a penalty to select important feature variables. We then construct one-step improvements of the policy parameter estimators for valid inference. Theoretically, we show that the improved estimators are asymptotically normal, even if nuisance parameters are estimated at a slow convergence rate and the dimension of the feature variables increases with the sample size. Our numerical studies demonstrate that the proposed method estimates a sparse policy with a near-optimal value function and conducts valid inference for the policy parameters.

Related papers

Inference on Optimal Policy Values and Other Irregular Functionals via Smoothing [24.31381939721388]
We show that a softmax smoothing-based estimator can be used to estimate parameters that are specified as a maximum of scores involving nuisance components.<n>Our estimator obtains $sqrtn$ convergence rates, avoids parametric restrictions/unrealistic margin assumptions, and is often statistically efficient.
arXiv Detail & Related papers (2025-07-15T22:38:39Z)
Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose. In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z)
Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples. However, IS is employed in RL as a passive tool for re-weighting historical samples. We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
Nonparametric estimation of a covariate-adjusted counterfactual treatment regimen response curve [2.7446241148152253]
Flexible estimation of the mean outcome under a treatment regimen is a key step toward personalized medicine. We propose an inverse probability weighted nonparametrically efficient estimator of the smoothed regimen-response curve function. Some finite-sample properties are explored with simulations.
arXiv Detail & Related papers (2023-09-28T01:46:24Z)
High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms. We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z)
Quantile Off-Policy Evaluation via Deep Conditional Generative Learning [21.448553360543478]
Off-Policy evaluation (OPE) is concerned with evaluating a new target policy using offline data generated by a potentially different behavior policy. We propose a doubly-robust inference procedure for quantile OPE in sequential decision making. We demonstrate the advantages of this proposed estimator through both simulations and a real-world dataset from a short-video platform.
arXiv Detail & Related papers (2022-12-29T22:01:43Z)
Data-Driven Influence Functions for Optimization-Based Causal Inference [105.5385525290466]
We study a constructive algorithm that approximates Gateaux derivatives for statistical functionals by finite differencing. We study the case where probability distributions are not known a priori but need to be estimated from data.
arXiv Detail & Related papers (2022-08-29T16:16:22Z)
Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response. We construct unbiased estimators for the policy-dependent estimand by a perturbation method. We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z)
A Minimax Learning Approach to Off-Policy Evaluation in Partially Observable Markov Decision Processes [31.215206208622728]
We consider off-policy evaluation (OPE) in Partially Observable Markov Decision Processes (POMDPs) Existing methods suffer from either a large bias in the presence of unmeasured confounders, or a large variance in settings with continuous or large observation/state spaces. We first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution.
arXiv Detail & Related papers (2021-11-12T15:52:24Z)
The Role of Lookahead and Approximate Policy Evaluation in Policy Iteration with Linear Value Function Approximation [14.528756508275622]
We show that when linear function approximation is used to represent the value function, a certain minimum amount of lookahead and multi-step return is needed. And when this condition is met, we characterize the finite-time performance of policies obtained using such approximate policy.
arXiv Detail & Related papers (2021-09-28T01:20:08Z)
Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation. We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z)
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies [80.42316902296832]
We study the estimation of policy value and gradient of a deterministic policy from off-policy data when actions are continuous. In this setting, standard importance sampling and doubly robust estimators for policy value and gradient fail because the density ratio does not exist. We propose several new doubly robust estimators based on different kernelization approaches.
arXiv Detail & Related papers (2020-06-06T15:52:05Z)
Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL) We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)
Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings [0.0]
We construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. We show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status.
arXiv Detail & Related papers (2020-01-13T19:42:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.