Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias
- URL: http://arxiv.org/abs/2310.08558v1
- Date: Thu, 12 Oct 2023 17:50:09 GMT
- Title: Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias
- Authors: Max Sobol Mark, Archit Sharma, Fahim Tajwar, Rafael Rafailov, Sergey
Levine, Chelsea Finn
- Abstract summary: offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
- Score: 96.14064037614942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is desirable for policies to optimistically explore new states and
behaviors during online reinforcement learning (RL) or fine-tuning, especially
when prior offline data does not provide enough state coverage. However,
exploration bonuses can bias the learned policy, and our experiments find that
naive, yet standard use of such bonuses can fail to recover a performant
policy. Concurrently, pessimistic training in offline RL has enabled recovery
of performant policies from static datasets. Can we leverage offline RL to
recover better policies from online interaction? We make a simple observation
that a policy can be trained from scratch on all interaction data with
pessimistic objectives, thereby decoupling the policies used for data
collection and for evaluation. Specifically, we propose offline retraining, a
policy extraction step at the end of online fine-tuning in our
Offline-to-Online-to-Offline (OOO) framework for reinforcement learning (RL).
An optimistic (exploration) policy is used to interact with the environment,
and a separate pessimistic (exploitation) policy is trained on all the observed
data for evaluation. Such decoupling can reduce any bias from online
interaction (intrinsic rewards, primacy bias) in the evaluation policy, and can
allow more exploratory behaviors during online interaction which in turn can
generate better data for exploitation. OOO is complementary to several
offline-to-online RL and online RL methods, and improves their average
performance by 14% to 26% in our fine-tuning experiments, achieves
state-of-the-art performance on several environments in the D4RL benchmarks,
and improves online RL performance by 165% on two OpenAI gym environments.
Further, OOO can enable fine-tuning from incomplete offline datasets where
prior methods can fail to recover a performant policy. Implementation:
https://github.com/MaxSobolMark/OOO
Related papers
- Bayesian Design Principles for Offline-to-Online Reinforcement Learning [50.97583504192167]
offline-to-online fine-tuning is crucial for real-world applications where exploration can be costly or unsafe.
In this paper, we tackle the dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop.
We show that Bayesian design principles are crucial in solving such a dilemma.
arXiv Detail & Related papers (2024-05-31T16:31:07Z) - Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL [42.57662196581823]
Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks.
Most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer.
We present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy.
arXiv Detail & Related papers (2024-05-28T18:38:46Z) - Train Once, Get a Family: State-Adaptive Balances for Offline-to-Online
Reinforcement Learning [71.02384943570372]
Family Offline-to-Online RL (FamO2O) is a framework that empowers existing algorithms to determine state-adaptive improvement-constraint balances.
FamO2O offers a statistically significant improvement over various existing methods, achieving state-of-the-art performance on the D4RL benchmark.
arXiv Detail & Related papers (2023-10-27T08:30:54Z) - Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning [9.341618348621662]
We aim to find the best-performing policy within a limited budget of online interactions.
We first study the major online RL exploration methods based on intrinsic rewards and UCB.
We then introduce an algorithm for planning to go out-of-distribution that avoids these issues.
arXiv Detail & Related papers (2023-10-09T13:47:05Z) - Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs
and Practical Solutions [30.050083797177706]
offline reinforcement learning (RL) allows for the training of competent agents from offline datasets without any interaction with the environment.
Online finetuning of such offline models can further improve performance.
We show that it is possible to use standard online off-policy algorithms for faster improvement.
arXiv Detail & Related papers (2023-03-30T14:08:31Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - Adaptive Behavior Cloning Regularization for Stable Offline-to-Online
Reinforcement Learning [80.25648265273155]
Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment.
During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data.
We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability.
Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark.
arXiv Detail & Related papers (2022-10-25T09:08:26Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Near Real-World Benchmarks for Offline Reinforcement Learning [26.642722521820467]
We present a suite of near real-world benchmarks, NewRL.
NewRL contains datasets from various domains with controlled sizes and extra test datasets for the purpose of policy validation.
We argue that the performance of a policy should also be compared with the deterministic version of the behavior policy, instead of the dataset reward.
arXiv Detail & Related papers (2021-02-01T09:19:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.