Decoupled Prioritized Resampling for Offline RL
- URL: http://arxiv.org/abs/2306.05412v3
- Date: Fri, 26 Jan 2024 15:03:21 GMT
- Title: Decoupled Prioritized Resampling for Offline RL
- Authors: Yang Yue, Bingyi Kang, Xiao Ma, Qisen Yang, Gao Huang, Shiji Song,
Shuicheng Yan
- Abstract summary: We propose Offline Prioritized Experience Replay (OPER) for offline reinforcement learning.
OPER features a class of priority functions designed to prioritize highly-rewarding transitions, making them more frequently visited during training.
We show that this class of priority functions induce an improved behavior policy, and when constrained to this improved policy, a policy-constrained offline RL algorithm is likely to yield a better solution.
- Score: 120.49021589395005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) is challenged by the distributional shift
problem. To address this problem, existing works mainly focus on designing
sophisticated policy constraints between the learned policy and the behavior
policy. However, these constraints are applied equally to well-performing and
inferior actions through uniform sampling, which might negatively affect the
learned policy. To alleviate this issue, we propose Offline Prioritized
Experience Replay (OPER), featuring a class of priority functions designed to
prioritize highly-rewarding transitions, making them more frequently visited
during training. Through theoretical analysis, we show that this class of
priority functions induce an improved behavior policy, and when constrained to
this improved policy, a policy-constrained offline RL algorithm is likely to
yield a better solution. We develop two practical strategies to obtain priority
weights by estimating advantages based on a fitted value network (OPER-A) or
utilizing trajectory returns (OPER-R) for quick computation. OPER is a
plug-and-play component for offline RL algorithms. As case studies, we evaluate
OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and
IQL. Extensive experiments demonstrate that both OPER-A and OPER-R
significantly improve the performance for all baseline methods. Codes and
priority weights are availiable at https://github.com/sail-sg/OPER.
Related papers
- Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL.
We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Regret Minimization Experience Replay [14.233842517210437]
prioritized sampling is a promising technique to improve the performance of RL agents.
In this work, we analyze the optimal prioritization strategy that can minimize the regret of RL policy theoretically.
We propose two practical algorithms, RM-DisCor and RM-TCE.
arXiv Detail & Related papers (2021-05-15T16:08:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.