Robust On-Policy Data Collection for Data-Efficient Policy Evaluation
- URL: http://arxiv.org/abs/2111.14552v1
- Date: Mon, 29 Nov 2021 14:30:26 GMT
- Title: Robust On-Policy Data Collection for Data-Efficient Policy Evaluation
- Authors: Rujie Zhong, Josiah P. Hanna, Lukas Sch\"afer, Stefano V. Albrecht
- Abstract summary: In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest.
We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset.
We show that simply running the evaluation policy -- on-policy data collection -- is sub-optimal for this setting.
- Score: 7.745028845389033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper considers how to complement offline reinforcement learning (RL)
data with additional data collection for the task of policy evaluation. In
policy evaluation, the task is to estimate the expected return of an evaluation
policy on an environment of interest. Prior work on offline policy evaluation
typically only considers a static dataset. We consider a setting where we can
collect a small amount of additional data to combine with a potentially larger
offline RL dataset. We show that simply running the evaluation policy --
on-policy data collection -- is sub-optimal for this setting. We then introduce
two new data collection strategies for policy evaluation, both of which
consider previously collected data when collecting future data so as to reduce
distribution shift (or sampling error) in the entire dataset collected. Our
empirical results show that compared to on-policy sampling, our strategies
produce data with lower sampling error and generally lead to lower mean-squared
error in policy evaluation for any total dataset size. We also show that these
strategies can start from initial off-policy data, collect additional data, and
then use both the initial and new data to produce low mean-squared error policy
evaluation without using off-policy corrections.
Related papers
- Doubly Optimal Policy Evaluation for Reinforcement Learning [16.7091722884524]
Policy evaluation often suffers from large variance and requires massive data to achieve desired accuracy.
In this work, we design an optimal combination of data-collecting policy and data-processing baseline.
Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods.
arXiv Detail & Related papers (2024-10-03T05:47:55Z) - Dataset Clustering for Improved Offline Policy Learning [7.873623003095065]
offline policy learning aims to discover decision-making policies from previously-collected datasets without additional online interactions with the environment.
This paper studies a dataset characteristic that we refer to as multi-behavior, indicating that the dataset is collected using multiple policies that exhibit distinct behaviors.
We propose a behavior-aware deep clustering approach that partitions multi-behavior datasets into several uni-behavior subsets.
arXiv Detail & Related papers (2024-02-14T20:01:41Z) - When is Off-Policy Evaluation (Reward Modeling) Useful in Contextual Bandits? A Data-Centric Perspective [64.73162159837956]
evaluating the value of a hypothetical target policy with only a logged dataset is important but challenging.
We propose DataCOPE, a data-centric framework for evaluating a target policy given a dataset.
Our empirical analysis of DataCOPE in the logged contextual bandit settings using healthcare datasets confirms its ability to evaluate both machine-learning and human expert policies.
arXiv Detail & Related papers (2023-11-23T17:13:37Z) - On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling [3.5253513747455303]
We introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms.
Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy.
arXiv Detail & Related papers (2023-11-14T16:37:28Z) - Off-Policy Evaluation for Large Action Spaces via Policy Convolution [60.6953713877886]
Policy Convolution family of estimators uses latent structure within actions to strategically convolve the logging and target policies.
Experiments on synthetic and benchmark datasets demonstrate remarkable mean squared error (MSE) improvements when using PC.
arXiv Detail & Related papers (2023-10-24T01:00:01Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Policy Finetuning in Reinforcement Learning via Design of Experiments
using Offline Data [17.317841035807696]
We propose an algorithm that can leverage an offline dataset to design a single non-reactive policy for exploration.
We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected.
arXiv Detail & Related papers (2023-07-10T05:33:41Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Supervised Off-Policy Ranking [145.3039527243585]
Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy.
We propose supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance.
Our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies.
arXiv Detail & Related papers (2021-07-03T07:01:23Z) - Sparse Feature Selection Makes Batch Reinforcement Learning More Sample
Efficient [62.24615324523435]
This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation.
When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient.
arXiv Detail & Related papers (2020-11-08T16:48:02Z) - Off-Policy Evaluation of Bandit Algorithm from Dependent Samples under
Batch Update Policy [8.807587076209566]
The goal of off-policy evaluation (OPE) is to evaluate a new policy using historical data obtained via a behavior policy.
Because the contextual bandit updates the policy based on past observations, the samples are not independent and identically distributed.
This paper tackles this problem by constructing an estimator from a martingale difference sequence (MDS) for the dependent samples.
arXiv Detail & Related papers (2020-10-23T15:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.