Efficient Sparse-Reward Goal-Conditioned Reinforcement Learning with a
High Replay Ratio and Regularization
- URL: http://arxiv.org/abs/2312.05787v1
- Date: Sun, 10 Dec 2023 06:30:19 GMT
- Title: Efficient Sparse-Reward Goal-Conditioned Reinforcement Learning with a
High Replay Ratio and Regularization
- Authors: Takuya Hiraoka
- Abstract summary: Reinforcement learning (RL) methods with a high replay ratio (RR) and regularization have gained interest due to their superior sample efficiency.
In this paper, we aim to extend these RL methods to sparse-reward goal-conditioned tasks.
- Score: 1.57731592348751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) methods with a high replay ratio (RR) and
regularization have gained interest due to their superior sample efficiency.
However, these methods have mainly been developed for dense-reward tasks. In
this paper, we aim to extend these RL methods to sparse-reward goal-conditioned
tasks. We use Randomized Ensemble Double Q-learning (REDQ) (Chen et al., 2021),
an RL method with a high RR and regularization. To apply REDQ to sparse-reward
goal-conditioned tasks, we make the following modifications to it: (i) using
hindsight experience replay and (ii) bounding target Q-values. We evaluate REDQ
with these modifications on 12 sparse-reward goal-conditioned tasks of Robotics
(Plappert et al., 2018), and show that it achieves about $2 \times$ better
sample efficiency than previous state-of-the-art (SoTA) RL methods.
Furthermore, we reconsider the necessity of specific components of REDQ and
simplify it by removing unnecessary ones. The simplified REDQ with our
modifications achieves $\sim 8 \times$ better sample efficiency than the SoTA
methods in 4 Fetch tasks of Robotics.
Related papers
- VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates.
We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z) - Adaptive $Q$-Network: On-the-fly Target Selection for Deep Reinforcement Learning [18.579378919155864]
We propose Adaptive $Q$Network (AdaQN) to take into account the non-stationarity of the optimization procedure without requiring additional samples.
AdaQN is theoretically sound and empirically validate it in MuJoCo control problems and Atari $2600 games.
arXiv Detail & Related papers (2024-05-25T11:57:43Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - Train Hard, Fight Easy: Robust Meta Reinforcement Learning [78.16589993684698]
A major challenge of reinforcement learning (RL) in real-world applications is the variation between environments, tasks or clients.
Standard MRL methods optimize the average return over tasks, but often suffer from poor results in tasks of high risk or difficulty.
In this work, we define a robust MRL objective with a controlled level.
The data inefficiency is addressed via the novel Robust Meta RL algorithm (RoML)
arXiv Detail & Related papers (2023-01-26T14:54:39Z) - Extreme Q-Learning: MaxEnt RL without Entropy [88.97516083146371]
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains.
We introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT)
Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms.
arXiv Detail & Related papers (2023-01-05T23:14:38Z) - Learning Progress Driven Multi-Agent Curriculum [18.239527837186216]
Curriculum reinforcement learning aims to speed up learning by gradually increasing the difficulty of a task.
We propose self-paced MARL (SPMARL) to prioritize tasks based on textitlearning progress instead of the episode return.
arXiv Detail & Related papers (2022-05-20T08:16:30Z) - Simultaneous Double Q-learning with Conservative Advantage Learning for
Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL)
Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - Randomized Ensembled Double Q-Learning: Learning Fast Without a Model [8.04816643418952]
We introduce a simple model-free algorithm, Randomized Ensembled Double Q-Learning (REDQ)
We show that REDQ's performance is just as good as, if not better than, a state-of-the-art model-based algorithm for the MuJoCo benchmark.
arXiv Detail & Related papers (2021-01-15T06:25:58Z) - Active Finite Reward Automaton Inference and Reinforcement Learning
Using Queries and Counterexamples [31.31937554018045]
Deep reinforcement learning (RL) methods require intensive data from the exploration of the environment to achieve satisfactory performance.
We propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations.
Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm.
arXiv Detail & Related papers (2020-06-28T21:13:08Z) - CrossQ: Batch Normalization in Deep Reinforcement Learning for Greater Sample Efficiency and Simplicity [34.36803740112609]
CrossQ matches or surpasses current state-of-the-art methods in terms of sample efficiency.
It substantially reduces the computational cost compared to REDQ and DroQ.
It is easy to implement, requiring just a few lines of code on top of SAC.
arXiv Detail & Related papers (2019-02-14T21:05:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.