A Non-Monolithic Policy Approach of Offline-to-Online Reinforcement Learning
- URL: http://arxiv.org/abs/2410.23737v1
- Date: Thu, 31 Oct 2024 08:49:37 GMT
- Title: A Non-Monolithic Policy Approach of Offline-to-Online Reinforcement Learning
- Authors: JaeYoon Kim, Junyu Xuan, Christy Liang, Farookh Hussain,
- Abstract summary: offline-to-online reinforcement learning (RL) uses both pre-trained offline policies and online policies trained for downstream tasks.
In this study, we propose an innovative offline-to-online RL method that employs a non-monolithic exploration approach.
- Score: 2.823645435281551
- License:
- Abstract: Offline-to-online reinforcement learning (RL) leverages both pre-trained offline policies and online policies trained for downstream tasks, aiming to improve data efficiency and accelerate performance enhancement. An existing approach, Policy Expansion (PEX), utilizes a policy set composed of both policies without modifying the offline policy for exploration and learning. However, this approach fails to ensure sufficient learning of the online policy due to an excessive focus on exploration with both policies. Since the pre-trained offline policy can assist the online policy in exploiting a downstream task based on its prior experience, it should be executed effectively and tailored to the specific requirements of the downstream task. In contrast, the online policy, with its immature behavioral strategy, has the potential for exploration during the training phase. Therefore, our research focuses on harmonizing the advantages of the offline policy, termed exploitation, with those of the online policy, referred to as exploration, without modifying the offline policy. In this study, we propose an innovative offline-to-online RL method that employs a non-monolithic exploration approach. Our methodology demonstrates superior performance compared to PEX.
Related papers
- Bayesian Design Principles for Offline-to-Online Reinforcement Learning [50.97583504192167]
offline-to-online fine-tuning is crucial for real-world applications where exploration can be costly or unsafe.
In this paper, we tackle the dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop.
We show that Bayesian design principles are crucial in solving such a dilemma.
arXiv Detail & Related papers (2024-05-31T16:31:07Z) - Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL [42.57662196581823]
Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks.
Most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer.
We present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy.
arXiv Detail & Related papers (2024-05-28T18:38:46Z) - Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - Finetuning from Offline Reinforcement Learning: Challenges, Trade-offs
and Practical Solutions [30.050083797177706]
offline reinforcement learning (RL) allows for the training of competent agents from offline datasets without any interaction with the environment.
Online finetuning of such offline models can further improve performance.
We show that it is possible to use standard online off-policy algorithms for faster improvement.
arXiv Detail & Related papers (2023-03-30T14:08:31Z) - Policy Expansion for Bridging Offline-to-Online Reinforcement Learning [20.24902196844508]
In this work, we introduce a policy expansion scheme for this task.
After learning the offline policy, we use it as one candidate policy in a policy set.
We then expand the policy set with another policy which will be responsible for further learning.
arXiv Detail & Related papers (2023-02-02T08:25:12Z) - Model-Based Offline Meta-Reinforcement Learning with Regularization [63.35040401948943]
offline Meta-RL is emerging as a promising approach to address these challenges.
MerPO learns a meta-model for efficient task structure inference and an informative meta-policy.
We show that MerPO offers guaranteed improvement over both the behavior policy and the meta-policy.
arXiv Detail & Related papers (2022-02-07T04:15:20Z) - MOORe: Model-based Offline-to-Online Reinforcement Learning [26.10368749930102]
We propose a model-based Offline-to-Online Reinforcement learning (MOORe) algorithm.
Experiment results show that our algorithm smoothly transfers from offline to online stages while enabling sample-efficient online adaption.
arXiv Detail & Related papers (2022-01-25T03:14:57Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Combining Online Learning and Offline Learning for Contextual Bandits
with Deficient Support [53.11601029040302]
Current offline-policy learning algorithms are mostly based on inverse propensity score (IPS) weighting.
We propose a novel approach that uses a hybrid of offline learning with online exploration.
Our approach determines an optimal policy with theoretical guarantees using the minimal number of online explorations.
arXiv Detail & Related papers (2021-07-24T05:07:43Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.