Stitching Sub-Trajectories with Conditional Diffusion Model for
Goal-Conditioned Offline RL
- URL: http://arxiv.org/abs/2402.07226v1
- Date: Sun, 11 Feb 2024 15:23:13 GMT
- Title: Stitching Sub-Trajectories with Conditional Diffusion Model for
Goal-Conditioned Offline RL
- Authors: Sungyoon Kim, Yunseon Choi, Daiki E. Matsunaga, and Kee-Eung Kim
- Abstract summary: We propose a model-based offline Goal-Conditioned Reinforcement Learning (Offline GCRL) method to acquire diverse goal-oriented skills.
In this paper, we use the diffusion model that generates future plans conditioned on the target goal and value, with the target value estimated from the goal-relabeled offline dataset.
We report state-of-the-art performance in the standard benchmark set of GCRL tasks, and demonstrate the capability to successfully stitch the segments of suboptimal trajectories in the offline data to generate high-quality plans.
- Score: 18.31263353823447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline Goal-Conditioned Reinforcement Learning (Offline GCRL) is an
important problem in RL that focuses on acquiring diverse goal-oriented skills
solely from pre-collected behavior datasets. In this setting, the reward
feedback is typically absent except when the goal is achieved, which makes it
difficult to learn policies especially from a finite dataset of suboptimal
behaviors. In addition, realistic scenarios involve long-horizon planning,
which necessitates the extraction of useful skills within sub-trajectories.
Recently, the conditional diffusion model has been shown to be a promising
approach to generate high-quality long-horizon plans for RL. However, their
practicality for the goal-conditioned setting is still limited due to a number
of technical assumptions made by the methods. In this paper, we propose SSD
(Sub-trajectory Stitching with Diffusion), a model-based offline GCRL method
that leverages the conditional diffusion model to address these limitations. In
summary, we use the diffusion model that generates future plans conditioned on
the target goal and value, with the target value estimated from the
goal-relabeled offline dataset. We report state-of-the-art performance in the
standard benchmark set of GCRL tasks, and demonstrate the capability to
successfully stitch the segments of suboptimal trajectories in the offline data
to generate high-quality plans.
Related papers
- Directed Exploration in Reinforcement Learning from Linear Temporal Logic [59.707408697394534]
Linear temporal logic (LTL) is a powerful language for task specification in reinforcement learning.
We show that the synthesized reward signal remains fundamentally sparse, making exploration challenging.
We show how better exploration can be achieved by further leveraging the specification and casting its corresponding Limit Deterministic B"uchi Automaton (LDBA) as a Markov reward process.
arXiv Detail & Related papers (2024-08-18T14:25:44Z) - A Tractable Inference Perspective of Offline RL [36.563229330549284]
A popular paradigm for offline Reinforcement Learning (RL) tasks is to first fit the offline trajectories to a sequence model, and then prompt the model for actions that lead to high expected return.
This paper highlights that tractability, the ability to exactly and efficiently answer various probabilistic queries, plays an important role in offline RL.
We propose Trifle, which bridges the gap between good sequence models and high expected returns at evaluation time.
arXiv Detail & Related papers (2023-10-31T19:16:07Z) - GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models [31.628341050846768]
Goal-conditioned Offline Planning (GOPlan) is a novel model-based framework that contains two key phases.
GOPlan pretrains a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset.
The reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals.
arXiv Detail & Related papers (2023-10-30T21:19:52Z) - HIQL: Offline Goal-Conditioned RL with Latent States as Actions [81.67963770528753]
We propose a hierarchical algorithm for goal-conditioned RL from offline data.
We show how this hierarchical decomposition makes our method robust to noise in the estimated value function.
Our method can solve long-horizon tasks that stymie prior methods, can scale to high-dimensional image observations, and can readily make use of action-free data.
arXiv Detail & Related papers (2023-07-22T00:17:36Z) - Reparameterized Policy Learning for Multimodal Trajectory Optimization [61.13228961771765]
We investigate the challenge of parametrizing policies for reinforcement learning in high-dimensional continuous action spaces.
We propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories.
We present a practical model-based RL method, which leverages the multimodal policy parameterization and learned world model.
arXiv Detail & Related papers (2023-07-20T09:05:46Z) - Goal-Conditioned Predictive Coding for Offline Reinforcement Learning [24.300131097275298]
We investigate whether sequence modeling has the ability to condense trajectories into useful representations that enhance policy learning.
We introduce Goal-Conditioned Predictive Coding, a sequence modeling objective that yields powerful trajectory representations and leads to performant policies.
arXiv Detail & Related papers (2023-07-07T06:12:14Z) - Imitating Graph-Based Planning with Goal-Conditioned Policies [72.61631088613048]
We present a self-imitation scheme which distills a subgoal-conditioned policy into the target-goal-conditioned policy.
We empirically show that our method can significantly boost the sample-efficiency of the existing goal-conditioned RL methods.
arXiv Detail & Related papers (2023-03-20T14:51:10Z) - Swapped goal-conditioned offline reinforcement learning [8.284193221280216]
We present a general offline reinforcement learning method called deterministic Q-advantage policy gradient (DQAPG)
In the experiments, DQAPG outperforms state-of-the-art goal-conditioned offline RL methods in a wide range of benchmark tasks.
arXiv Detail & Related papers (2023-02-17T13:22:40Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z) - Model-Based Offline Planning with Trajectory Pruning [15.841609263723575]
offline reinforcement learning (RL) enables learning policies using pre-collected datasets without environment interaction.
We propose a new light-weighted model-based offline planning framework, namely MOPP, which tackles the dilemma between the restrictions of offline learning and high-performance planning.
Experimental results show that MOPP provides competitive performance compared with existing model-based offline planning and RL approaches.
arXiv Detail & Related papers (2021-05-16T05:00:54Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.