Related papers: Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

URL: http://arxiv.org/abs/2405.18520v1
Date: Tue, 28 May 2024 18:38:46 GMT
Title: Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL
Authors: Yu Luo, Tianying Ji, Fuchun Sun, Jianwei Zhang, Huazhe Xu, Xianyuan Zhan,
Abstract summary: Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks. Most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer. We present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy.
Score: 42.57662196581823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks, by leveraging previously collected data for policy learning. However, most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer, limiting sample efficiency and policy performance. In this work, we discover that concurrently training an offline RL policy based on the shared online replay buffer can sometimes outperform the original online learning policy, though the occurrence of such performance gains remains uncertain. This motivates a new possibility of harnessing the emergent outperforming offline optimal policy to improve online policy learning. Based on this insight, we present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy through value comparison, and uses it as an adaptive constraint to guarantee stronger policy learning performance. Our experiments demonstrate that OBAC outperforms other popular model-free RL baselines and rivals advanced model-based RL methods in terms of sample efficiency and asymptotic performance across 53 tasks spanning 6 task suites.

Related papers

Active Advantage-Aligned Online Reinforcement Learning with Offline Data [56.98480620108727]
A3 RL is a novel method that actively selects data from combined online and offline sources to optimize policy improvement. We provide theoretical guarantee that validates the effectiveness of our active sampling strategy.
arXiv Detail & Related papers (2025-02-11T20:31:59Z)
Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone [72.17534881026995]
We develop an offline and online fine-tuning approach called policy-agnostic RL (PA-RL) We show the first result that successfully fine-tunes OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm.
arXiv Detail & Related papers (2024-12-09T17:28:03Z)
Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL. We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z)
Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees [23.838354396418868]
We propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data. Our approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework.
arXiv Detail & Related papers (2023-11-14T18:45:56Z)
Behavior Proximal Policy Optimization [14.701955559885615]
offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly. Online on-policy algorithms are naturally able to solve offline RL. We propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization.
arXiv Detail & Related papers (2023-02-22T11:49:12Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
Model-Based Offline Meta-Reinforcement Learning with Regularization [63.35040401948943]
offline Meta-RL is emerging as a promising approach to address these challenges. MerPO learns a meta-model for efficient task structure inference and an informative meta-policy. We show that MerPO offers guaranteed improvement over both the behavior policy and the meta-policy.
arXiv Detail & Related papers (2022-02-07T04:15:20Z)
MOORe: Model-based Offline-to-Online Reinforcement Learning [26.10368749930102]
We propose a model-based Offline-to-Online Reinforcement learning (MOORe) algorithm. Experiment results show that our algorithm smoothly transfers from offline to online stages while enabling sample-efficient online adaption.
arXiv Detail & Related papers (2022-01-25T03:14:57Z)
Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment. We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return. On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z)
Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL [82.93243616342275]
We introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE) MABE is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary. In experiments that require cross-domain generalization, we find that MABE outperforms prior methods.
arXiv Detail & Related papers (2021-06-16T20:48:49Z)
Representation Matters: Offline Pretraining for Sequential Decision Making [27.74988221252854]
In this paper, we consider a slightly different approach to incorporating offline data into sequential decision-making. We find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms.
arXiv Detail & Related papers (2021-02-11T02:38:12Z)
POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view. We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy. We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.