Adaptive Experimental Design for Policy Learning
- URL: http://arxiv.org/abs/2401.03756v3
- Date: Thu, 8 Feb 2024 17:41:43 GMT
- Title: Adaptive Experimental Design for Policy Learning
- Authors: Masahiro Kato and Kyohei Okumura and Takuya Ishihara and Toru Kitagawa
- Abstract summary: We study an optimal adaptive experimental design for policy learning with multiple treatment arms.
In the sampling stage, the planner assigns treatment arms adaptively over sequentially arriving experimental units.
After the experiment, the planner recommends an individualized assignment rule to the population.
- Score: 9.54473759331265
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evidence-based targeting has been a topic of growing interest among the
practitioners of policy and business. Formulating decision-maker's policy
learning as a fixed-budget best arm identification (BAI) problem with
contextual information, we study an optimal adaptive experimental design for
policy learning with multiple treatment arms. In the sampling stage, the
planner assigns treatment arms adaptively over sequentially arriving
experimental units upon observing their contextual information (covariates).
After the experiment, the planner recommends an individualized assignment rule
to the population. Setting the worst-case expected regret as the performance
criterion of adaptive sampling and recommended policies, we derive its
asymptotic lower bounds, and propose a strategy, Adaptive Sampling-Policy
Learning strategy (PLAS), whose leading factor of the regret upper bound aligns
with the lower bound as the size of experimental units increases.
Related papers
- Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems.
In common practice, convergence (hyper)policies are learned only to deploy their deterministic version.
We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z) - POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy
Decomposition [40.851324484481275]
We study off-policy learning of contextual bandit policies in large discrete action spaces.
We propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition.
We show that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
arXiv Detail & Related papers (2024-02-09T03:01:13Z) - Experiment Planning with Function Approximation [49.50254688629728]
We study the problem of experiment planning with function approximation in contextual bandit problems.
We propose two experiment planning strategies compatible with function approximation.
We show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small.
arXiv Detail & Related papers (2024-01-10T14:40:23Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Contextual Bandits in a Survey Experiment on Charitable Giving:
Within-Experiment Outcomes versus Policy Learning [21.9468085255912]
We design and implement an adaptive experiment (a contextual bandit'') to learn a targeted treatment assignment policy.
The goal is to use a participant's survey responses to determine which charity to expose them to in a donation solicitation.
We evaluate alternative experimental designs by collecting pilot data and then conducting a simulation study.
arXiv Detail & Related papers (2022-11-22T04:44:17Z) - Variance Reduction based Experience Replay for Policy Optimization [3.0790370651488983]
Variance Reduction Experience Replay (VRER) is a framework for the selective reuse of relevant samples to improve policy gradient estimation.
VRER forms the foundation of our sample efficient off-policy learning algorithm known as Policy Gradient with VRER.
arXiv Detail & Related papers (2021-10-17T19:28:45Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Progressive Multi-Stage Learning for Discriminative Tracking [25.94944743206374]
We propose a joint discriminative learning scheme with the progressive multi-stage optimization policy of sample selection for robust visual tracking.
The proposed scheme presents a novel time-weighted and detection-guided self-paced learning strategy for easy-to-hard sample selection.
Experiments on the benchmark datasets demonstrate the effectiveness of the proposed learning framework.
arXiv Detail & Related papers (2020-04-01T07:01:30Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.