Contextual Bandits in a Survey Experiment on Charitable Giving:
Within-Experiment Outcomes versus Policy Learning
- URL: http://arxiv.org/abs/2211.12004v1
- Date: Tue, 22 Nov 2022 04:44:17 GMT
- Title: Contextual Bandits in a Survey Experiment on Charitable Giving:
Within-Experiment Outcomes versus Policy Learning
- Authors: Susan Athey, Undral Byambadalai, Vitor Hadad, Sanath Kumar
Krishnamurthy, Weiwen Leung, Joseph Jay Williams
- Abstract summary: We design and implement an adaptive experiment (a contextual bandit'') to learn a targeted treatment assignment policy.
The goal is to use a participant's survey responses to determine which charity to expose them to in a donation solicitation.
We evaluate alternative experimental designs by collecting pilot data and then conducting a simulation study.
- Score: 21.9468085255912
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We design and implement an adaptive experiment (a ``contextual bandit'') to
learn a targeted treatment assignment policy, where the goal is to use a
participant's survey responses to determine which charity to expose them to in
a donation solicitation. The design balances two competing objectives:
optimizing the outcomes for the subjects in the experiment (``cumulative regret
minimization'') and gathering data that will be most useful for policy
learning, that is, for learning an assignment rule that will maximize welfare
if used after the experiment (``simple regret minimization''). We evaluate
alternative experimental designs by collecting pilot data and then conducting a
simulation study. Next, we implement our selected algorithm. Finally, we
perform a second simulation study anchored to the collected data that evaluates
the benefits of the algorithm we chose. Our first result is that the value of a
learned policy in this setting is higher when data is collected via a uniform
randomization rather than collected adaptively using standard cumulative regret
minimization or policy learning algorithms. We propose a simple heuristic for
adaptive experimentation that improves upon uniform randomization from the
perspective of policy learning at the expense of increasing cumulative regret
relative to alternative bandit algorithms. The heuristic modifies an existing
contextual bandit algorithm by (i) imposing a lower bound on assignment
probabilities that decay slowly so that no arm is discarded too quickly, and
(ii) after adaptively collecting data, restricting policy learning to select
from arms where sufficient data has been gathered.
Related papers
- Experiment Planning with Function Approximation [49.50254688629728]
We study the problem of experiment planning with function approximation in contextual bandit problems.
We propose two experiment planning strategies compatible with function approximation.
We show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small.
arXiv Detail & Related papers (2024-01-10T14:40:23Z) - Adaptive Experimental Design for Policy Learning [9.54473759331265]
We study an optimal adaptive experimental design for policy learning with multiple treatment arms.
In the sampling stage, the planner assigns treatment arms adaptively over sequentially arriving experimental units.
After the experiment, the planner recommends an individualized assignment rule to the population.
arXiv Detail & Related papers (2024-01-08T09:29:07Z) - SPEED: Experimental Design for Policy Evaluation in Linear
Heteroscedastic Bandits [13.02672341061555]
We study the problem of optimal data collection for policy evaluation in linear bandits.
We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting.
We then use this formulation to derive the optimal allocation of samples per action during data collection.
arXiv Detail & Related papers (2023-01-29T04:33:13Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Policy Learning with Adaptively Collected Data [22.839095992238537]
We address the challenge of learning the optimal policy with adaptively collected data.
We propose an algorithm based on generalized augmented inverse propensity weighted estimators.
We demonstrate our algorithm's effectiveness using both synthetic data and public benchmark datasets.
arXiv Detail & Related papers (2021-05-05T22:03:10Z) - Meta-learning with Stochastic Linear Bandits [120.43000970418939]
We consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector.
We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
arXiv Detail & Related papers (2020-05-18T08:41:39Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.