Policy Finetuning in Reinforcement Learning via Design of Experiments
using Offline Data
- URL: http://arxiv.org/abs/2307.04354v1
- Date: Mon, 10 Jul 2023 05:33:41 GMT
- Title: Policy Finetuning in Reinforcement Learning via Design of Experiments
using Offline Data
- Authors: Ruiqi Zhang, Andrea Zanette
- Abstract summary: We propose an algorithm that can leverage an offline dataset to design a single non-reactive policy for exploration.
We theoretically analyze the algorithm and measure the quality of the final policy as a function of the local coverage of the original dataset and the amount of additional data collected.
- Score: 17.317841035807696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In some applications of reinforcement learning, a dataset of pre-collected
experience is already available but it is also possible to acquire some
additional online data to help improve the quality of the policy. However, it
may be preferable to gather additional data with a single, non-reactive
exploration policy and avoid the engineering costs associated with switching
policies.
In this paper we propose an algorithm with provable guarantees that can
leverage an offline dataset to design a single non-reactive policy for
exploration. We theoretically analyze the algorithm and measure the quality of
the final policy as a function of the local coverage of the original dataset
and the amount of additional data collected.
Related papers
- OPERA: Automatic Offline Policy Evaluation with Re-weighted Aggregates of Multiple Estimators [13.408838970377035]
offline policy evaluation (OPE) allows us to evaluate and estimate a new sequential decision-making policy's performance.
We propose a new algorithm that adaptively blends a set of OPE estimators given a dataset without relying on an explicit selection using a statistical procedure.
Our work contributes to improving ease of use for a general-purpose, estimator-agnostic, off-policy evaluation framework for offline RL.
arXiv Detail & Related papers (2024-05-27T23:51:20Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Iteratively Refined Behavior Regularization for Offline Reinforcement
Learning [57.10922880400715]
In this paper, we propose a new algorithm that substantially enhances behavior-regularization based on conservative policy iteration.
By iteratively refining the reference policy used for behavior regularization, conservative policy update guarantees gradually improvement.
Experimental results on the D4RL benchmark indicate that our method outperforms previous state-of-the-art baselines in most tasks.
arXiv Detail & Related papers (2023-06-09T07:46:24Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - Latent-Variable Advantage-Weighted Policy Optimization for Offline RL [70.01851346635637]
offline reinforcement learning methods hold the promise of learning policies from pre-collected datasets without the need to query the environment for new transitions.
In practice, offline datasets are often heterogeneous, i.e., collected in a variety of scenarios.
We propose to leverage latent-variable policies that can represent a broader class of policy distributions.
Our method improves the average performance of the next best-performing offline reinforcement learning methods by 49% on heterogeneous datasets.
arXiv Detail & Related papers (2022-03-16T21:17:03Z) - Statistically Efficient Advantage Learning for Offline Reinforcement
Learning in Infinite Horizons [16.635744815056906]
We consider reinforcement learning methods in offline domains without additional online data collection, such as mobile health applications.
The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator.
arXiv Detail & Related papers (2022-02-26T15:29:46Z) - Robust On-Policy Data Collection for Data-Efficient Policy Evaluation [7.745028845389033]
In policy evaluation, the task is to estimate the expected return of an evaluation policy on an environment of interest.
We consider a setting where we can collect a small amount of additional data to combine with a potentially larger offline RL dataset.
We show that simply running the evaluation policy -- on-policy data collection -- is sub-optimal for this setting.
arXiv Detail & Related papers (2021-11-29T14:30:26Z) - Policy Learning with Adaptively Collected Data [22.839095992238537]
We address the challenge of learning the optimal policy with adaptively collected data.
We propose an algorithm based on generalized augmented inverse propensity weighted estimators.
We demonstrate our algorithm's effectiveness using both synthetic data and public benchmark datasets.
arXiv Detail & Related papers (2021-05-05T22:03:10Z) - Benchmarks for Deep Off-Policy Evaluation [152.28569758144022]
We present a collection of policies that can be used for benchmarking off-policy evaluation.
The goal of our benchmark is to provide a standardized measure of progress that is motivated from a set of principles.
We provide open-source access to our data and code to foster future research in this area.
arXiv Detail & Related papers (2021-03-30T18:09:33Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.