Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL
- URL: http://arxiv.org/abs/2405.18520v1
- Date: Tue, 28 May 2024 18:38:46 GMT
- Title: Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL
- Authors: Yu Luo, Tianying Ji, Fuchun Sun, Jianwei Zhang, Huazhe Xu, Xianyuan Zhan,
- Abstract summary: Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks.
Most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer.
We present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy.
- Score: 42.57662196581823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks, by leveraging previously collected data for policy learning. However, most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer, limiting sample efficiency and policy performance. In this work, we discover that concurrently training an offline RL policy based on the shared online replay buffer can sometimes outperform the original online learning policy, though the occurrence of such performance gains remains uncertain. This motivates a new possibility of harnessing the emergent outperforming offline optimal policy to improve online policy learning. Based on this insight, we present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy through value comparison, and uses it as an adaptive constraint to guarantee stronger policy learning performance. Our experiments demonstrate that OBAC outperforms other popular model-free RL baselines and rivals advanced model-based RL methods in terms of sample efficiency and asymptotic performance across 53 tasks spanning 6 task suites.
Related papers
- Active Advantage-Aligned Online Reinforcement Learning with Offline Data [56.98480620108727]
A3 RL is a novel method that actively selects data from combined online and offline sources to optimize policy improvement.
We provide theoretical guarantee that validates the effectiveness of our active sampling strategy.
arXiv Detail & Related papers (2025-02-11T20:31:59Z) - Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL)
We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z) - Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL.
We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z) - Behavior Proximal Policy Optimization [14.701955559885615]
offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly.
Online on-policy algorithms are naturally able to solve offline RL.
We propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization.
arXiv Detail & Related papers (2023-02-22T11:49:12Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - MOORe: Model-based Offline-to-Online Reinforcement Learning [26.10368749930102]
We propose a model-based Offline-to-Online Reinforcement learning (MOORe) algorithm.
Experiment results show that our algorithm smoothly transfers from offline to online stages while enabling sample-efficient online adaption.
arXiv Detail & Related papers (2022-01-25T03:14:57Z) - Behavioral Priors and Dynamics Models: Improving Performance and Domain
Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE)
MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary.
In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z) - Representation Matters: Offline Pretraining for Sequential Decision
Making [27.74988221252854]
In this paper, we consider a slightly different approach to incorporating offline data into sequential decision-making.
We find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms.
arXiv Detail & Related papers (2021-02-11T02:38:12Z) - POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view.
We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy.
We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.