Policy Learning with Adaptively Collected Data
- URL: http://arxiv.org/abs/2105.02344v1
- Date: Wed, 5 May 2021 22:03:10 GMT
- Title: Policy Learning with Adaptively Collected Data
- Authors: Ruohan Zhan, Zhimei Ren, Susan Athey, Zhengyuan Zhou
- Abstract summary: We address the challenge of learning the optimal policy with adaptively collected data.
We propose an algorithm based on generalized augmented inverse propensity weighted estimators.
We demonstrate our algorithm's effectiveness using both synthetic data and public benchmark datasets.
- Score: 22.839095992238537
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning optimal policies from historical data enables the gains from
personalization to be realized in a wide variety of applications. The growing
policy learning literature focuses on a setting where the treatment assignment
policy does not adapt to the data. However, adaptive data collection is
becoming more common in practice, from two primary sources: 1) data collected
from adaptive experiments that are designed to improve inferential efficiency;
2) data collected from production systems that are adaptively evolving an
operational policy to improve performance over time (e.g. contextual bandits).
In this paper, we aim to address the challenge of learning the optimal policy
with adaptively collected data and provide one of the first theoretical
inquiries into this problem. We propose an algorithm based on generalized
augmented inverse propensity weighted estimators and establish its
finite-sample regret bound. We complement this regret upper bound with a lower
bound that characterizes the fundamental difficulty of policy learning with
adaptive data. Finally, we demonstrate our algorithm's effectiveness using both
synthetic data and public benchmark datasets.
Related papers
- Doubly Optimal Policy Evaluation for Reinforcement Learning [16.7091722884524]
Policy evaluation often suffers from large variance and requires massive data to achieve desired accuracy.
In this work, we design an optimal combination of data-collecting policy and data-processing baseline.
Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods.
arXiv Detail & Related papers (2024-10-03T05:47:55Z) - Experiment Planning with Function Approximation [49.50254688629728]
We study the problem of experiment planning with function approximation in contextual bandit problems.
We propose two experiment planning strategies compatible with function approximation.
We show that a uniform sampler achieves competitive optimality rates in the setting where the number of actions is small.
arXiv Detail & Related papers (2024-01-10T14:40:23Z) - When to Learn What: Model-Adaptive Data Augmentation Curriculum [32.99634881669643]
We propose Model Adaptive Data Augmentation (MADAug) to jointly train an augmentation policy network to teach the model when to learn what.
Unlike previous work, MADAug selects augmentation operators for each input image by a model-adaptive policy varying between training stages, producing a data augmentation curriculum optimized for better generalization.
arXiv Detail & Related papers (2023-09-09T10:35:27Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded.
We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - A Regularized Implicit Policy for Offline Reinforcement Learning [54.7427227775581]
offline reinforcement learning enables learning from a fixed dataset, without further interactions with the environment.
We propose a framework that supports learning a flexible yet well-regularized fully-implicit policy.
Experiments and ablation study on the D4RL dataset validate our framework and the effectiveness of our algorithmic designs.
arXiv Detail & Related papers (2022-02-19T20:22:04Z) - Model Selection in Batch Policy Optimization [88.52887493684078]
We study the problem of model selection in batch policy optimization.
We identify three sources of error that any model selection algorithm should optimally trade-off in order to be competitive.
arXiv Detail & Related papers (2021-12-23T02:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.