Related papers: Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

URL: http://arxiv.org/abs/2304.14391v4
Date: Tue, 23 Jan 2024 15:52:28 GMT
Title: Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
Authors: Nikolaos Gkanatsios, Ayush Jain, Zhou Xian, Yunchu Zhang, Christopher Atkeson, Katerina Fragkiadaki
Abstract summary: We show that our framework can execute compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model by a large margin, especially for long instructions that involve compositions of multiple concepts.
Score: 19.494104738436892
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.

Related papers

LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps [18.602777449136738]
We propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks.
arXiv Detail & Related papers (2025-03-15T18:54:06Z)
LiLMaps: Learnable Implicit Language Maps [18.342569823885864]
We present an approach that enhances incremental implicit mapping through the integration of vision-language features. Specifically, we (i) propose a decoder optimization technique for implicit language maps which can be used when new objects appear on the scene, and (ii) address the problem of inconsistent vision-language predictions between different viewing positions.
arXiv Detail & Related papers (2025-01-06T16:04:56Z)
Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues. We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction. We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z)
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation. We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects. We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z)
Simple Embodied Language Learning as a Byproduct of Meta-Reinforcement Learning [56.07190845063208]
We ask: can embodied reinforcement learning (RL) agents indirectly learn language from non-language tasks? We design an office navigation environment, where the agent's goal is to find a particular office, and office locations differ in different buildings (i.e., tasks) We find RL agents indeed are able to indirectly learn language. Agents trained with current meta-RL algorithms successfully generalize to reading floor plans with held-out layouts and language phrases.
arXiv Detail & Related papers (2023-06-14T09:48:48Z)
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks. Our approach is adjustable and flexible in accommodating various instruction modalities and input types. Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z)
Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following [101.55727845195969]
We propose Embodied Learner Concept (ECL) in an interactive 3D environment. A robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks. ECL is fully transparent and step-by-step interpretable in long-term planning.
arXiv Detail & Related papers (2023-04-07T17:59:34Z)
Object-centric Inference for Language Conditioned Placement: A Foundation Model based Approach [12.016988248578027]
We focus on the task of language-conditioned object placement, in which a robot should generate placements that satisfy all the spatial constraints in language instructions. We propose an object-centric framework that leverages foundation models to ground the reference objects and spatial relations for placement, which is more sample efficient and generalizable.
arXiv Detail & Related papers (2023-04-06T06:51:15Z)
Differentiable Parsing and Visual Grounding of Verbal Instructions for Object Placement [26.74189486483276]
We introduce ParaGon, a PARsing And visual GrOuNding framework for language-conditioned object placement. It parses language instructions into relations between objects and grounds those objects in visual scenes. ParaGon encodes all of those procedures into neural networks for end-to-end training.
arXiv Detail & Related papers (2022-10-01T07:36:51Z)
VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation [11.92150014766458]
We aim to fill the blank of the last mile of embodied agents -- object manipulation by following human guidance. We build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks. modular rule-based task templates are created to automatically generate robot demonstrations with language instructions.
arXiv Detail & Related papers (2022-06-17T03:07:18Z)
Identifying concept libraries from language about object structure [56.83719358616503]
We leverage natural language descriptions for a diverse set of 2K procedurally generated objects to identify the parts people use. We formalize our problem as search over a space of program libraries that contain different part concepts. By combining naturalistic language at scale with structured program representations, we discover a fundamental information-theoretic tradeoff governing the part concepts people name.
arXiv Detail & Related papers (2022-05-11T17:49:25Z)
LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space. We present our approach of constructing analogy datasets in terms of words, phrases and sentences. We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.