Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
- URL: http://arxiv.org/abs/2410.01345v1
- Date: Wed, 2 Oct 2024 09:02:34 GMT
- Title: Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
- Authors: Ricardo Garcia, Shizhe Chen, Cordelia Schmid,
- Abstract summary: GemBench is a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies.
We present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs.
3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.
- Score: 68.50785963043161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generalizing language-conditioned robotic policies to new tasks remains a significant challenge, hampered by the lack of suitable simulation benchmarks. In this paper, we address this gap by introducing GemBench, a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies. GemBench incorporates seven general action primitives and four levels of generalization, spanning novel placements, rigid and articulated objects, and complex long-horizon tasks. We evaluate state-of-the-art approaches on GemBench and also introduce a new method. Our approach 3D-LOTUS leverages rich 3D information for action prediction conditioned on language. While 3D-LOTUS excels in both efficiency and performance on seen tasks, it struggles with novel tasks. To address this, we present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs and the object grounding accuracy of VLMs. 3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation. The benchmark, codes and trained models are available at \url{https://www.di.ens.fr/willow/research/gembench/}.
Related papers
- PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model [4.079327215055764]
Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world.
Visual Language Models (VLMs) have excelled in high-level reasoning but fall short in grasping the nuanced physical properties required for effective human-robot interaction.
We introduce PAVLM, an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud.
arXiv Detail & Related papers (2024-10-15T12:53:42Z) - M3Bench: Benchmarking Whole-body Motion Generation for Mobile Manipulation in 3D Scenes [66.44171200767839]
We propose M3Bench, a new benchmark of whole-body motion generation for mobile manipulation tasks.
M3Bench requires an embodied agent to understand its configuration, environmental constraints and task objectives.
M3Bench features 30k object rearrangement tasks across 119 diverse scenes, providing expert demonstrations generated by our newly developed M3BenchMaker.
arXiv Detail & Related papers (2024-10-09T08:38:21Z) - LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image [72.14973729674995]
Current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories.
We propose solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - An Embodied Generalist Agent in 3D World [67.16935110789528]
We introduce LEO, an embodied multi-modal generalist agent that excels in perceiving, grounding, reasoning, planning, and acting in the 3D world.
We collect large-scale datasets comprising diverse object-level and scene-level tasks, which require considerable understanding of and interaction with the 3D world.
Through extensive experiments, we demonstrate LEO's remarkable proficiency across a wide spectrum of tasks, including 3D captioning, question answering, embodied reasoning, navigation and manipulation.
arXiv Detail & Related papers (2023-11-18T01:21:38Z) - Generalizable Long-Horizon Manipulations with Large Language Models [91.740084601715]
This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations.
We create a challenging robotic manipulation task suite based on Pybullet for long-horizon task evaluation.
arXiv Detail & Related papers (2023-10-03T17:59:46Z) - ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous
States in Realistic 3D Scenes [72.83187997344406]
ARNOLD is a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes.
ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals.
arXiv Detail & Related papers (2023-04-09T21:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.