Energy-based Models are Zero-Shot Planners for Compositional Scene
Rearrangement
- URL: http://arxiv.org/abs/2304.14391v4
- Date: Tue, 23 Jan 2024 15:52:28 GMT
- Title: Energy-based Models are Zero-Shot Planners for Compositional Scene
Rearrangement
- Authors: Nikolaos Gkanatsios, Ayush Jain, Zhou Xian, Yunchu Zhang, Christopher
Atkeson, Katerina Fragkiadaki
- Abstract summary: We show that our framework can execute compositional instructions zero-shot in simulation and in the real world.
It outperforms language-to-action reactive policies and Large Language Model by a large margin, especially for long instructions that involve compositions of multiple concepts.
- Score: 19.494104738436892
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Language is compositional; an instruction can express multiple relation
constraints to hold among objects in a scene that a robot is tasked to
rearrange. Our focus in this work is an instructable scene-rearranging
framework that generalizes to longer instructions and to spatial concept
compositions never seen at training time. We propose to represent
language-instructed spatial concepts with energy functions over relative object
arrangements. A language parser maps instructions to corresponding energy
functions and an open-vocabulary visual-language model grounds their arguments
to relevant objects in the scene. We generate goal scene configurations by
gradient descent on the sum of energy functions, one per language predicate in
the instruction. Local vision-based policies then re-locate objects to the
inferred goal locations. We test our model on established instruction-guided
manipulation benchmarks, as well as benchmarks of compositional instructions we
introduce. We show our model can execute highly compositional instructions
zero-shot in simulation and in the real world. It outperforms
language-to-action reactive policies and Large Language Model planners by a
large margin, especially for long instructions that involve compositions of
multiple spatial concepts. Simulation and real-world robot execution videos, as
well as our code and datasets are publicly available on our website:
https://ebmplanner.github.io.
Related papers
- Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction.
We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement
Learning [56.07190845063208]
We ask: can embodied reinforcement learning (RL) agents indirectly learn language from non-language tasks?
We design an office navigation environment, where the agent's goal is to find a particular office, and office locations differ in different buildings (i.e., tasks)
We find RL agents indeed are able to indirectly learn language. Agents trained with current meta-RL algorithms successfully generalize to reading floor plans with held-out layouts and language phrases.
arXiv Detail & Related papers (2023-06-14T09:48:48Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Embodied Concept Learner: Self-supervised Learning of Concepts and
Mapping through Instruction Following [101.55727845195969]
We propose Embodied Learner Concept (ECL) in an interactive 3D environment.
A robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks.
ECL is fully transparent and step-by-step interpretable in long-term planning.
arXiv Detail & Related papers (2023-04-07T17:59:34Z) - Object-centric Inference for Language Conditioned Placement: A
Foundation Model based Approach [12.016988248578027]
We focus on the task of language-conditioned object placement, in which a robot should generate placements that satisfy all the spatial constraints in language instructions.
We propose an object-centric framework that leverages foundation models to ground the reference objects and spatial relations for placement, which is more sample efficient and generalizable.
arXiv Detail & Related papers (2023-04-06T06:51:15Z) - Differentiable Parsing and Visual Grounding of Verbal Instructions for
Object Placement [26.74189486483276]
We introduce ParaGon, a PARsing And visual GrOuNding framework for language-conditioned object placement.
It parses language instructions into relations between objects and grounds those objects in visual scenes.
ParaGon encodes all of those procedures into neural networks for end-to-end training.
arXiv Detail & Related papers (2022-10-01T07:36:51Z) - VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation [11.92150014766458]
We aim to fill the blank of the last mile of embodied agents -- object manipulation by following human guidance.
We build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks.
modular rule-based task templates are created to automatically generate robot demonstrations with language instructions.
arXiv Detail & Related papers (2022-06-17T03:07:18Z) - Identifying concept libraries from language about object structure [56.83719358616503]
We leverage natural language descriptions for a diverse set of 2K procedurally generated objects to identify the parts people use.
We formalize our problem as search over a space of program libraries that contain different part concepts.
By combining naturalistic language at scale with structured program representations, we discover a fundamental information-theoretic tradeoff governing the part concepts people name.
arXiv Detail & Related papers (2022-05-11T17:49:25Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.