Unsupervised-to-Online Reinforcement Learning
- URL: http://arxiv.org/abs/2408.14785v1
- Date: Tue, 27 Aug 2024 05:23:45 GMT
- Title: Unsupervised-to-Online Reinforcement Learning
- Authors: Junsu Kim, Seohong Park, Sergey Levine,
- Abstract summary: Unsupervised-to-online RL (U2O RL) replaces domain-specific supervised offline RL with unsupervised offline RL.
U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations.
We empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches.
- Score: 59.910638327123394
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline-to-online reinforcement learning (RL), a framework that trains a policy with offline RL and then further fine-tunes it with online RL, has been considered a promising recipe for data-driven decision-making. While sensible, this framework has drawbacks: it requires domain-specific offline RL pre-training for each task, and is often brittle in practice. In this work, we propose unsupervised-to-online RL (U2O RL), which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL. U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations, which often result in even better performance and stability than supervised offline-to-online RL. To instantiate U2O RL in practice, we propose a general recipe for U2O RL to bridge task-agnostic unsupervised offline skill-based policy pre-training and supervised online fine-tuning. Throughout our experiments in nine state-based and pixel-based environments, we empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches, while being able to reuse a single pre-trained model for a number of different downstream tasks.
Related papers
- Active Advantage-Aligned Online Reinforcement Learning with Offline Data [56.98480620108727]
A3 RL is a novel method that actively selects data from combined online and offline sources to optimize policy improvement.
We provide theoretical guarantee that validates the effectiveness of our active sampling strategy.
arXiv Detail & Related papers (2025-02-11T20:31:59Z) - Efficient Online Reinforcement Learning Fine-Tuning Need Not Retain Offline Data [64.74333980417235]
We show that retaining offline data is unnecessary as long as we use a properly-designed online RL approach for fine-tuning offline RL.
We show that Warm-start RL (WSRL) is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms.
arXiv Detail & Related papers (2024-12-10T18:57:12Z) - Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned Offline Policy [12.589890916332196]
offline-to-online (O2O) RL can be leveraged to facilitate faster and safer online policy learning.
We introduce textbfMarvel, a novel framework for O2O safe RL, comprising two key components that work in concert.
Our work has the great potential to advance the field towards more efficient and practical safe RL solutions.
arXiv Detail & Related papers (2024-12-05T18:51:18Z) - Is Value Learning Really the Main Bottleneck in Offline RL? [70.54708989409409]
We show that the choice of a policy extraction algorithm significantly affects the performance and scalability of offline RL.
We propose two simple test-time policy improvement methods and show that these methods lead to better performance.
arXiv Detail & Related papers (2024-06-13T17:07:49Z) - ENOTO: Improving Offline-to-Online Reinforcement Learning with Q-Ensembles [52.34951901588738]
We propose a novel framework called ENsemble-based Offline-To-Online (ENOTO) RL.
By increasing the number of Q-networks, we seamlessly bridge offline pre-training and online fine-tuning without degrading performance.
Experimental results demonstrate that ENOTO can substantially improve the training stability, learning efficiency, and final performance of existing offline RL methods.
arXiv Detail & Related papers (2023-06-12T05:10:10Z) - Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning [93.99377042564919]
This paper tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages.
The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the "test bed" for offline policies.
We introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces.
arXiv Detail & Related papers (2023-05-24T15:45:35Z) - Using Offline Data to Speed Up Reinforcement Learning in Procedurally Generated Environments [16.62777710035937]
We study whether agents can leverage offline data in the form of trajectories to improve the sample-efficiency in procedurally generated environments.
We consider two settings of using IL from offline data for RL: (1) pre-training a policy before online RL training and (2) concurrently training a policy with online RL and IL from offline data.
arXiv Detail & Related papers (2023-04-18T16:23:15Z) - Contextual Transformer for Offline Meta Reinforcement Learning [16.587320914107128]
We show how prompts can improve sequence modeling-based offline reinforcement learning ( offline RL) algorithms.
We propose prompt tuning for offline RL, where a context vector sequence istextuald with the input to guide the conditional policy generation.
We extend our framework to Meta-RL settings and propose Contextual Meta Transformer (CMT); CMT leverages the context among different tasks as the prompt to improve generalization on unseen tasks.
arXiv Detail & Related papers (2022-11-15T10:00:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.