Compositional Foundation Models for Hierarchical Planning
- URL: http://arxiv.org/abs/2309.08587v2
- Date: Thu, 21 Sep 2023 14:49:20 GMT
- Title: Compositional Foundation Models for Hierarchical Planning
- Authors: Anurag Ajay, Seungwook Han, Yilun Du, Shuang Li, Abhi Gupta, Tommi
Jaakkola, Josh Tenenbaum, Leslie Kaelbling, Akash Srivastava, Pulkit Agrawal
- Abstract summary: We propose a foundation model which leverages expert foundation model trained on language, vision and action data individually together to solve long-horizon tasks.
We use a large language model to construct symbolic plans that are grounded in the environment through a large video diffusion model.
Generated video plans are then grounded to visual-motor control, through an inverse dynamics model that infers actions from generated videos.
- Score: 52.18904315515153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To make effective decisions in novel environments with long-horizon goals, it
is crucial to engage in hierarchical reasoning across spatial and temporal
scales. This entails planning abstract subgoal sequences, visually reasoning
about the underlying plans, and executing actions in accordance with the
devised plan through visual-motor control. We propose Compositional Foundation
Models for Hierarchical Planning (HiP), a foundation model which leverages
multiple expert foundation model trained on language, vision and action data
individually jointly together to solve long-horizon tasks. We use a large
language model to construct symbolic plans that are grounded in the environment
through a large video diffusion model. Generated video plans are then grounded
to visual-motor control, through an inverse dynamics model that infers actions
from generated videos. To enable effective reasoning within this hierarchy, we
enforce consistency between the models via iterative refinement. We illustrate
the efficacy and adaptability of our approach in three different long-horizon
table-top manipulation tasks.
Related papers
- Egocentric Vision Language Planning [44.436317004108105]
We explore leveraging large multi-modal models (LMMs) and text2image models to build a more general embodied agent.
The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective.
arXiv Detail & Related papers (2024-08-11T15:37:29Z) - Adaptive Planning with Generative Models under Uncertainty [20.922248169620783]
Planning with generative models has emerged as an effective decision-making paradigm across a wide range of domains.
While continuous replanning at each timestep might seem intuitive because it allows decisions to be made based on the most recent environmental observations, it results in substantial computational challenges.
Our work addresses this challenge by introducing a simple adaptive planning policy that leverages the generative model's ability to predict long-horizon state trajectories.
arXiv Detail & Related papers (2024-08-02T18:07:53Z) - Planning as In-Painting: A Diffusion-Based Embodied Task Planning
Framework for Environments under Uncertainty [56.30846158280031]
Task planning for embodied AI has been one of the most challenging problems.
We propose a task-agnostic method named 'planning as in-painting'
The proposed framework achieves promising performances in various embodied AI tasks.
arXiv Detail & Related papers (2023-12-02T10:07:17Z) - FloorGenT: Generative Vector Graphic Model of Floor Plans for Robotics [5.71097144710995]
We show that by modelling floor plans as sequences of line segments seen from a particular point of view, recent advances in autoregressive sequence modelling can be leveraged to model and predict floor plans.
arXiv Detail & Related papers (2022-03-07T13:42:48Z) - Learning Models as Functionals of Signed-Distance Fields for
Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene.
We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z) - Predictive Control Using Learned State Space Models via Rolling Horizon
Evolution [2.1016374925364616]
In this paper, we explore this theme combining evolutionary algorithmic planning techniques with models learned via deep learning and variational inference.
We demonstrate the approach with an agent that reliably performs online planning in a set of visual navigation tasks.
arXiv Detail & Related papers (2021-06-25T23:23:42Z) - A Consciousness-Inspired Planning Agent for Model-Based Reinforcement
Learning [104.3643447579578]
We present an end-to-end, model-based deep reinforcement learning agent which dynamically attends to relevant parts of its state.
The design allows agents to learn to plan effectively, by attending to the relevant objects, leading to better out-of-distribution generalization.
arXiv Detail & Related papers (2021-06-03T19:35:19Z) - Goal-Aware Prediction: Learning to Model What Matters [105.43098326577434]
One of the fundamental challenges in using a learned forward dynamics model is the mismatch between the objective of the learned model and that of the downstream planner or policy.
We propose to direct prediction towards task relevant information, enabling the model to be aware of the current task and encouraging it to only model relevant quantities of the state space.
We find that our method more effectively models the relevant parts of the scene conditioned on the goal, and as a result outperforms standard task-agnostic dynamics models and model-free reinforcement learning.
arXiv Detail & Related papers (2020-07-14T16:42:59Z) - Latent Space Roadmap for Visual Action Planning of Deformable and Rigid
Object Manipulation [74.88956115580388]
Planning is performed in a low-dimensional latent state space that embeds images.
Our framework consists of two main components: a Visual Foresight Module (VFM) that generates a visual plan as a sequence of images, and an Action Proposal Network (APN) that predicts the actions between them.
arXiv Detail & Related papers (2020-03-19T18:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.