Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid
Reinforcement Learning
- URL: http://arxiv.org/abs/2305.10282v1
- Date: Wed, 17 May 2023 15:17:23 GMT
- Title: Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid
Reinforcement Learning
- Authors: Gen Li, Wenhao Zhan, Jason D. Lee, Yuejie Chi, Yuxin Chen
- Abstract summary: A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset.
We design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL.
The proposed algorithm does not require any reward information during data collection.
- Score: 66.43003402281659
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper studies tabular reinforcement learning (RL) in the hybrid setting,
which assumes access to both an offline dataset and online interactions with
the unknown environment. A central question boils down to how to efficiently
utilize online data collection to strengthen and complement the offline dataset
and enable effective policy fine-tuning. Leveraging recent advances in
reward-agnostic exploration and model-based offline RL, we design a three-stage
hybrid RL algorithm that beats the best of both worlds -- pure offline RL and
pure online RL -- in terms of sample complexities. The proposed algorithm does
not require any reward information during data collection. Our theory is
developed based on a new notion called single-policy partial concentrability,
which captures the trade-off between distribution mismatch and miscoverage and
guides the interplay between offline and online data.
Related papers
- Preference Elicitation for Offline Reinforcement Learning [59.136381500967744]
We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm.
Our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy.
arXiv Detail & Related papers (2024-06-26T15:59:13Z) - Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees [23.838354396418868]
We propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data.
Our approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework.
arXiv Detail & Related papers (2023-11-14T18:45:56Z) - Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Semi-Offline Reinforcement Learning for Optimized Text Generation [35.1606951874979]
In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline.
Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability.
We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings.
arXiv Detail & Related papers (2023-06-16T09:24:29Z) - Adaptive Policy Learning for Offline-to-Online Reinforcement Learning [27.80266207283246]
We consider an offline-to-online setting where the agent is first learned from the offline dataset and then trained online.
We propose a framework called Adaptive Policy Learning for effectively taking advantage of offline and online data.
arXiv Detail & Related papers (2023-03-14T08:13:21Z) - Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient [42.47810044648846]
We consider a hybrid reinforcement learning setting (Hybrid RL) in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction.
We adapt the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q.
We show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks.
arXiv Detail & Related papers (2022-10-13T04:19:05Z) - Don't Change the Algorithm, Change the Data: Exploratory Data for
Offline Reinforcement Learning [147.61075994259807]
We propose Exploratory data for Offline RL (ExORL), a data-centric approach to offline RL.
ExORL first generates data with unsupervised reward-free exploration, then relabels this data with a downstream reward before training a policy with offline RL.
We find that exploratory data allows vanilla off-policy RL algorithms, without any offline-specific modifications, to outperform or match state-of-the-art offline RL algorithms on downstream tasks.
arXiv Detail & Related papers (2022-01-31T18:39:27Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z) - D4RL: Datasets for Deep Data-Driven Reinforcement Learning [119.49182500071288]
We introduce benchmarks specifically designed for the offline setting, guided by key properties of datasets relevant to real-world applications of offline RL.
By moving beyond simple benchmark tasks and data collected by partially-trained RL agents, we reveal important and unappreciated deficiencies of existing algorithms.
arXiv Detail & Related papers (2020-04-15T17:18:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.