Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online
Reinforcement Learning
- URL: http://arxiv.org/abs/2310.17966v2
- Date: Mon, 30 Oct 2023 05:22:23 GMT
- Title: Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online
Reinforcement Learning
- Authors: Shenzhi Wang, Qisen Yang, Jiawei Gao, Matthieu Gaetan Lin, Hao Chen,
Liwei Wu, Ning Jia, Shiji Song, Gao Huang
- Abstract summary: Family Offline-to-Online RL (FamO2O) is a framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances.
FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark.
- Score: 71.02384943570372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline-to-online reinforcement learning (RL) is a training paradigm that
combines pre-training on a pre-collected dataset with fine-tuning in an online
environment. However, the incorporation of online fine-tuning can intensify the
well-known distributional shift problem. Existing solutions tackle this problem
by imposing a policy constraint on the policy improvement objective in both
offline and online learning. They typically advocate a single balance between
policy improvement and constraints across diverse data collections. This
one-size-fits-all manner may not optimally leverage each collected sample due
to the significant variation in data quality across different states. To this
end, we introduce Family Offline-to-Online RL (FamO2O), a simple yet effective
framework that empowers existing algorithms to determine state-adaptive
improvement-constraint balances. FamO2O utilizes a universal model to train a
family of policies with different improvement/constraint intensities, and a
balance model to select a suitable policy for each state. Theoretically, we
prove that state-adaptive balances are necessary for achieving a higher policy
performance upper bound. Empirically, extensive experiments show that FamO2O
offers a statistically significant improvement over various existing methods,
achieving state-of-the-art performance on the D4RL benchmark. Codes are
available at https://github.com/LeapLabTHU/FamO2O.
Related papers
- Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness [11.241036026084222]
offline-to-online (O2O) RL provides a paradigm for improving an offline trained agent within limited online interactions.
Most offline RL algorithms suffer from performance drops and fail to achieve stable policy improvement in O2O adaptation.
We propose the Robust Offline-to-Online (RO2O) algorithm, designed to enhance offline policies through uncertainty and smoothness.
arXiv Detail & Related papers (2023-09-29T04:42:50Z) - Hundreds Guide Millions: Adaptive Offline Reinforcement Learning with
Expert Guidance [74.31779732754697]
We propose a novel plug-in approach named Guided Offline RL (GORL)
GORL employs a guiding network, along with only a few expert demonstrations, to adaptively determine the relative importance of the policy improvement and policy constraint for every sample.
Experiments on various environments suggest that GORL can be easily installed on most offline RL algorithms with statistically significant performance improvements.
arXiv Detail & Related papers (2023-09-04T08:59:04Z) - Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs
and Practical Solutions [30.050083797177706]
offline reinforcement learning (RL) allows for the training of competent agents from offline datasets without any interaction with the environment.
Online finetuning of such offline models can further improve performance.
We show that it is possible to use standard online off-policy algorithms for faster improvement.
arXiv Detail & Related papers (2023-03-30T14:08:31Z) - Improving TD3-BC: Relaxed Policy Constraint for Offline Learning and
Stable Online Fine-Tuning [7.462336024223669]
Key challenge is overcoming overestimation bias for actions not present in data.
One simple method to reduce this bias is to introduce a policy constraint via behavioural cloning (BC)
We demonstrate that by continuing to train a policy offline while reducing the influence of the BC component we can produce refined policies.
arXiv Detail & Related papers (2022-11-21T19:10:27Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - MUSBO: Model-based Uncertainty Regularized and Sample Efficient Batch
Optimization for Deployment Constrained Reinforcement Learning [108.79676336281211]
Continuous deployment of new policies for data collection and online learning is either cost ineffective or impractical.
We propose a new algorithmic learning framework called Model-based Uncertainty regularized and Sample Efficient Batch Optimization.
Our framework discovers novel and high quality samples for each deployment to enable efficient data collection.
arXiv Detail & Related papers (2021-02-23T01:30:55Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.