Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees
- URL: http://arxiv.org/abs/2311.08384v1
- Date: Tue, 14 Nov 2023 18:45:56 GMT
- Title: Offline Data Enhanced On-Policy Policy Gradient with Provable Guarantees
- Authors: Yifei Zhou, Ayush Sekhari, Yuda Song, Wen Sun
- Abstract summary: We propose a new hybrid RL algorithm that combines an on-policy actor-critic method with offline data.
Our approach integrates a procedure of off-policy training on the offline data into an on-policy NPG framework.
- Score: 23.838354396418868
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Hybrid RL is the setting where an RL agent has access to both offline data
and online data by interacting with the real-world environment. In this work,
we propose a new hybrid RL algorithm that combines an on-policy actor-critic
method with offline data. On-policy methods such as policy gradient and natural
policy gradient (NPG) have shown to be more robust to model misspecification,
though sometimes it may not be as sample efficient as methods that rely on
off-policy learning. On the other hand, offline methods that depend on
off-policy training often require strong assumptions in theory and are less
stable to train in practice. Our new approach integrates a procedure of
off-policy training on the offline data into an on-policy NPG framework. We
show that our approach, in theory, can obtain a best-of-both-worlds type of
result -- it achieves the state-of-art theoretical guarantees of offline RL
when offline RL-specific assumptions hold, while at the same time maintaining
the theoretical guarantees of on-policy NPG regardless of the offline RL
assumptions' validity. Experimentally, in challenging rich-observation
environments, we show that our approach outperforms a state-of-the-art hybrid
RL baseline which only relies on off-policy policy optimization, demonstrating
the empirical benefit of combining on-policy and off-policy learning. Our code
is publicly available at https://github.com/YifeiZhou02/HNPG.
Related papers
- Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL [42.57662196581823]
Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks.
Most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer.
We present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy.
arXiv Detail & Related papers (2024-05-28T18:38:46Z) - Reward-agnostic Fine-tuning: Provable Statistical Benefits of Hybrid
Reinforcement Learning [66.43003402281659]
A central question boils down to how to efficiently utilize online data collection to strengthen and complement the offline dataset.
We design a three-stage hybrid RL algorithm that beats the best of both worlds -- pure offline RL and pure online RL.
The proposed algorithm does not require any reward information during data collection.
arXiv Detail & Related papers (2023-05-17T15:17:23Z) - Behavior Proximal Policy Optimization [14.701955559885615]
offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly.
Online on-policy algorithms are naturally able to solve offline RL.
We propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization.
arXiv Detail & Related papers (2023-02-22T11:49:12Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - A Policy-Guided Imitation Approach for Offline Reinforcement Learning [9.195775740684248]
We introduce Policy-guided Offline RL (textttPOR)
textttPOR demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline RL.
arXiv Detail & Related papers (2022-10-15T15:54:28Z) - Regularizing a Model-based Policy Stationary Distribution to Stabilize
Offline Reinforcement Learning [62.19209005400561]
offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets.
A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy.
We regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process.
arXiv Detail & Related papers (2022-06-14T20:56:16Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view.
We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy.
We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.