Accelerating Exploration with Unlabeled Prior Data
- URL: http://arxiv.org/abs/2311.05067v2
- Date: Tue, 21 Nov 2023 00:46:47 GMT
- Title: Accelerating Exploration with Unlabeled Prior Data
- Authors: Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, Sergey Levine
- Abstract summary: We study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task.
We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization.
- Score: 66.43995032226466
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to solve tasks from a sparse reward signal is a major challenge for
standard reinforcement learning (RL) algorithms. However, in the real world,
agents rarely need to solve sparse reward tasks entirely from scratch. More
often, we might possess prior experience to draw on that provides considerable
guidance about which actions and outcomes are possible in the world, which we
can use to explore more effectively for new tasks. In this work, we study how
prior data without reward labels may be used to guide and accelerate
exploration for an agent solving a new sparse reward task. We propose a simple
approach that learns a reward model from online experience, labels the
unlabeled prior data with optimistic rewards, and then uses it concurrently
alongside the online data for downstream policy and critic optimization. This
general formula leads to rapid exploration in several challenging sparse-reward
domains where tabula rasa exploration is insufficient, including the AntMaze
domain, Adroit hand manipulation domain, and a visual simulated robotic
manipulation domain. Our results highlight the ease of incorporating unlabeled
prior data into existing online RL algorithms, and the (perhaps surprising)
effectiveness of doing so.
Related papers
- Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
We study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies.
Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits.
We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - COG: Connecting New Skills to Past Experience with Offline Reinforcement
Learning [78.13740204156858]
We show that we can reuse prior data to extend new skills simply through dynamic programming.
We demonstrate the effectiveness of our approach by chaining together several behaviors seen in prior datasets for solving a new task.
We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands.
arXiv Detail & Related papers (2020-10-27T17:57:29Z) - Learning Guidance Rewards with Trajectory-space Smoothing [22.456737935789103]
Long-term temporal credit assignment is an important challenge in deep reinforcement learning.
Existing policy-gradient and Q-learning algorithms rely on dense environmental rewards that provide rich short-term supervision.
Recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards.
arXiv Detail & Related papers (2020-10-23T23:55:06Z) - Learning Dexterous Manipulation from Suboptimal Experts [69.8017067648129]
Relative Entropy Q-Learning (REQ) is a simple policy algorithm that combines ideas from successful offline and conventional RL algorithms.
We show how REQ is also effective for general off-policy RL, offline RL, and RL from demonstrations.
arXiv Detail & Related papers (2020-10-16T18:48:49Z) - Generalized Hindsight for Reinforcement Learning [154.0545226284078]
We argue that low-reward data collected while trying to solve one task provides little to no signal for solving that particular task.
We present Generalized Hindsight: an approximate inverse reinforcement learning technique for relabeling behaviors with the right tasks.
arXiv Detail & Related papers (2020-02-26T18:57:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.