Policy Expansion for Bridging Offline-to-Online Reinforcement Learning
- URL: http://arxiv.org/abs/2302.00935v3
- Date: Sat, 15 Apr 2023 20:34:57 GMT
- Title: Policy Expansion for Bridging Offline-to-Online Reinforcement Learning
- Authors: Haichao Zhang, We Xu, Haonan Yu
- Abstract summary: In this work, we introduce a policy expansion scheme for this task.
After learning the offline policy, we use it as one candidate policy in a policy set.
We then expand the policy set with another policy which will be responsible for further learning.
- Score: 20.24902196844508
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training with offline data and online fine-tuning using reinforcement
learning is a promising strategy for learning control policies by leveraging
the best of both worlds in terms of sample efficiency and performance. One
natural approach is to initialize the policy for online learning with the one
trained offline. In this work, we introduce a policy expansion scheme for this
task. After learning the offline policy, we use it as one candidate policy in a
policy set. We then expand the policy set with another policy which will be
responsible for further learning. The two policies will be composed in an
adaptive manner for interacting with the environment. With this approach, the
policy previously learned offline is fully retained during online learning,
thus mitigating the potential issues such as destroying the useful behaviors of
the offline policy in the initial stage of online learning while allowing the
offline policy participate in the exploration naturally in an adaptive manner.
Moreover, new useful behaviors can potentially be captured by the newly added
policy through learning. Experiments are conducted on a number of tasks and the
results demonstrate the effectiveness of the proposed approach.
Related papers
- A Non-Monolithic Policy Approach of Offline-to-Online Reinforcement Learning [2.823645435281551]
offline-to-online reinforcement learning (RL) uses both pre-trained offline policies and online policies trained for downstream tasks.
In this study, we propose an innovative offline-to-online RL method that employs a non-monolithic exploration approach.
arXiv Detail & Related papers (2024-10-31T08:49:37Z) - Bayesian Design Principles for Offline-to-Online Reinforcement Learning [50.97583504192167]
offline-to-online fine-tuning is crucial for real-world applications where exploration can be costly or unsafe.
In this paper, we tackle the dilemma of offline-to-online fine-tuning: if the agent remains pessimistic, it may fail to learn a better policy, while if it becomes optimistic directly, performance may suffer from a sudden drop.
We show that Bayesian design principles are crucial in solving such a dilemma.
arXiv Detail & Related papers (2024-05-31T16:31:07Z) - Uni-O4: Unifying Online and Offline Deep Reinforcement Learning with Multi-Step On-Policy Optimization [24.969834057981046]
Previous approaches treat offline and online learning as separate procedures, resulting in redundant designs and limited performance.
We propose Uni-o4, which utilizes an on-policy objective for both offline and online learning.
We demonstrate that our method achieves state-of-the-art performance in both offline and offline-to-online fine-tuning learning.
arXiv Detail & Related papers (2023-11-06T18:58:59Z) - Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate
Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed.
An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z) - IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - Residual Q-Learning: Offline and Online Policy Customization without
Value [53.47311900133564]
Imitation Learning (IL) is a widely used framework for learning imitative behavior from demonstrations.
We formulate a new problem setting called policy customization.
We propose a novel framework, Residual Q-learning, which can solve the formulated MDP by leveraging the prior policy.
arXiv Detail & Related papers (2023-06-15T22:01:19Z) - Safe Evaluation For Offline Learning: Are We Ready To Deploy? [47.331520779610535]
We introduce a framework for safe evaluation of offline learning using approximate high-confidence off-policy evaluation.
A lower-bound estimate tells us how good a newly-learned target policy would perform before it is deployed in the real environment.
arXiv Detail & Related papers (2022-12-16T06:43:16Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Behavior Constraining in Weight Space for Offline Reinforcement Learning [2.7184068098378855]
In offline reinforcement learning, a policy needs to be learned from a single dataset.
We propose a new algorithm, which constrains the policy directly in its weight space instead, and demonstrate its effectiveness in experiments.
arXiv Detail & Related papers (2021-07-12T14:50:50Z) - Non-Stationary Off-Policy Optimization [50.41335279896062]
We study the novel problem of off-policy optimization in piecewise-stationary contextual bandits.
In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state.
In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance.
arXiv Detail & Related papers (2020-06-15T09:16:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.