Adaptive Experimental Design for Policy Learning
- URL: http://arxiv.org/abs/2401.03756v4
- Date: Thu, 19 Jun 2025 14:27:47 GMT
- Title: Adaptive Experimental Design for Policy Learning
- Authors: Masahiro Kato, Kyohei Okumura, Takuya Ishihara, Toru Kitagawa,
- Abstract summary: We consider a decision-maker who assigns treatment arms to experimental units during an experiment and recommends the estimated best treatment arm based on the contexts at the end of the experiment.<n>We focus on the worst-case expected regret, a measure between the expected outcomes of an optimal policy and our proposed policy.<n>We prove that this strategy is minimax rate-optimal in the sense that its leading factor in the regret upper bound matches the lower bound as the number of experimental units increases.
- Score: 8.73717644648873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study investigates the contextual best arm identification (BAI) problem, aiming to design an adaptive experiment to identify the best treatment arm conditioned on contextual information (covariates). We consider a decision-maker who assigns treatment arms to experimental units during an experiment and recommends the estimated best treatment arm based on the contexts at the end of the experiment. The decision-maker uses a policy for recommendations, which is a function that provides the estimated best treatment arm given the contexts. In our evaluation, we focus on the worst-case expected regret, a relative measure between the expected outcomes of an optimal policy and our proposed policy. We derive a lower bound for the expected simple regret and then propose a strategy called Adaptive Sampling-Policy Learning (PLAS). We prove that this strategy is minimax rate-optimal in the sense that its leading factor in the regret upper bound matches the lower bound as the number of experimental units increases.
Related papers
- Adaptive Experiments Under High-Dimensional and Data Sparse Settings: Applications for Educational Platforms [10.565276803897325]
Traditional adaptive policies, such as Thompson Sampling, struggle with scalability in high-dimensional and sparse settings.
We propose a framework for determining the feasible number of treatments given a sample size.
We present comparative evaluations of WAPTS across various sample sizes and treatment conditions.
arXiv Detail & Related papers (2025-01-07T18:55:02Z) - Optimal Adaptive Experimental Design for Estimating Treatment Effect [14.088972921434761]
This paper addresses the fundamental question of determining the optimal accuracy in estimating the treatment effect.
By incorporating the concept of doubly robust method into sequential experimental design, we frame the optimal estimation problem as an online bandit learning problem.
Using tools and ideas from both bandit algorithm design and adaptive statistical estimation, we propose a general low switching adaptive experiment framework.
arXiv Detail & Related papers (2024-10-07T23:22:51Z) - Are causal effect estimations enough for optimal recommendations under multitreatment scenarios? [2.4578723416255754]
It is essential to include a causal effect estimation analysis to compare potential outcomes under different treatments or controls.
We propose a comprehensive methodology for multitreatment selection.
arXiv Detail & Related papers (2024-10-07T16:37:35Z) - Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition [40.851324484481275]
We study off-policy learning of contextual bandit policies in large discrete action spaces.
We propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition.
We show that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
arXiv Detail & Related papers (2024-02-09T03:01:13Z) - Experiment Planning with Function Approximation [49.50254688629728]
We study the problem of experiment planning with function approximation in contextual bandit problems.
We propose two experiment planning strategies compatible with function approximation.
We show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small.
arXiv Detail & Related papers (2024-01-10T14:40:23Z) - Adaptive Instrument Design for Indirect Experiments [48.815194906471405]
Unlike RCTs, indirect experiments estimate treatment effects by leveragingconditional instrumental variables.
In this paper we take the initial steps towards enhancing sample efficiency for indirect experiments by adaptively designing a data collection policy.
Our main contribution is a practical computational procedure that utilizes influence functions to search for an optimal data collection policy.
arXiv Detail & Related papers (2023-12-05T02:38:04Z) - Choosing a Proxy Metric from Past Experiments [54.338884612982405]
In many randomized experiments, the treatment effect of the long-term metric is often difficult or infeasible to measure.
A common alternative is to measure several short-term proxy metrics in the hope they closely track the long-term metric.
We introduce a new statistical framework to both define and construct an optimal proxy metric for use in a homogeneous population of randomized experiments.
arXiv Detail & Related papers (2023-09-14T17:43:02Z) - Pure Exploration under Mediators' Feedback [63.56002444692792]
Multi-armed bandits are a sequential-decision-making framework, where, at each interaction step, the learner selects an arm and observes a reward.
We consider the scenario in which the learner has access to a set of mediators, each of which selects the arms on the agent's behalf according to a and possibly unknown policy.
We propose a sequential decision-making strategy for discovering the best arm under the assumption that the mediators' policies are known to the learner.
arXiv Detail & Related papers (2023-08-29T18:18:21Z) - Asymptotically Optimal Fixed-Budget Best Arm Identification with
Variance-Dependent Bounds [10.915684166086026]
We investigate the problem of fixed-budget best arm identification (BAI) for minimizing expected simple regret.
We evaluate the decision based on the expected simple regret, which is the difference between the expected outcomes of the best arm and the recommended arm.
We propose the Two-Stage (TS)-Hirano-Imbens-Ridder (HIR) strategy, which utilizes the HIR estimator (Hirano et al., 2003) in recommending the best arm.
arXiv Detail & Related papers (2023-02-06T18:27:11Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Contextual Bandits in a Survey Experiment on Charitable Giving:
Within-Experiment Outcomes versus Policy Learning [21.9468085255912]
We design and implement an adaptive experiment (a contextual bandit'') to learn a targeted treatment assignment policy.
The goal is to use a participant's survey responses to determine which charity to expose them to in a donation solicitation.
We evaluate alternative experimental designs by collecting pilot data and then conducting a simulation study.
arXiv Detail & Related papers (2022-11-22T04:44:17Z) - Variance Reduction based Experience Replay for Policy Optimization [3.0790370651488983]
Variance Reduction Experience Replay (VRER) is a framework for the selective reuse of relevant samples to improve policy gradient estimation.
VRER forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER.
arXiv Detail & Related papers (2021-10-17T19:28:45Z) - Minimax Off-Policy Evaluation for Multi-Armed Bandits [58.7013651350436]
We study the problem of off-policy evaluation in the multi-armed bandit model with bounded rewards.
We develop minimax rate-optimal procedures under three settings.
arXiv Detail & Related papers (2021-01-19T18:55:29Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Treatment recommendation with distributional targets [0.0]
We study the problem of a decision maker who must provide the best possible treatment recommendation based on an experiment.
The desirability of the outcome distribution resulting from the treatment recommendation is measured through a functional distributional characteristic.
We propose two (near)-optimal regret policies.
arXiv Detail & Related papers (2020-05-19T19:27:21Z) - Progressive Multi-Stage Learning for Discriminative Tracking [25.94944743206374]
We propose a joint discriminative learning scheme with the progressive multi-stage optimization policy of sample selection for robust visual tracking.
The proposed scheme presents a novel time-weighted and detection-guided self-paced learning strategy for easy-to-hard sample selection.
Experiments on the benchmark datasets demonstrate the effectiveness of the proposed learning framework.
arXiv Detail & Related papers (2020-04-01T07:01:30Z) - Generalization Bounds and Representation Learning for Estimation of
Potential Outcomes and Causal Effects [61.03579766573421]
We study estimation of individual-level causal effects, such as a single patient's response to alternative medication.
We devise representation learning algorithms that minimize our bound, by regularizing the representation's induced treatment group distance.
We extend these algorithms to simultaneously learn a weighted representation to further reduce treatment group distances.
arXiv Detail & Related papers (2020-01-21T10:16:33Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.