Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
- URL: http://arxiv.org/abs/2506.11261v1
- Date: Thu, 12 Jun 2025 20:04:31 GMT
- Title: Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
- Authors: Shizhe Chen, Ricardo Garcia, Paul Pacaud, Cordelia Schmid,
- Abstract summary: We introduce Gondola, a grounded vision-language planning model based on large language models (LLMs) for generalizable robotic manipulation.<n>G Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations.<n>G Gondola outperforms the state-of-the-art LLM-based method across all four levels of the GemBench dataset.
- Score: 62.711546725154314
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robotic manipulation faces a significant challenge in generalizing across unseen objects, environments and tasks specified by diverse language instructions. To improve generalization capabilities, recent research has incorporated large language models (LLMs) for planning and action execution. While promising, these methods often fall short in generating grounded plans in visual environments. Although efforts have been made to perform visual instructional tuning on LLMs for robotic manipulation, existing methods are typically constrained by single-view image input and struggle with precise object grounding. In this work, we introduce Gondola, a novel grounded vision-language planning model based on LLMs for generalizable robotic manipulation. Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations. To support the training of Gondola, we construct three types of datasets using the RLBench simulator, namely robot grounded planning, multi-view referring expression and pseudo long-horizon task datasets. Gondola outperforms the state-of-the-art LLM-based method across all four generalization levels of the GemBench dataset, including novel placements, rigid objects, articulated objects and long-horizon tasks.
Related papers
- Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy [68.50785963043161]
GemBench is a novel benchmark to assess generalization capabilities of vision-language robotic manipulation policies.<n>We present 3D-LOTUS++, a framework that integrates 3D-LOTUS's motion planning capabilities with the task planning capabilities of LLMs.<n>3D-LOTUS++ achieves state-of-the-art performance on novel tasks of GemBench, setting a new standard for generalization in robotic manipulation.
arXiv Detail & Related papers (2024-10-02T09:02:34Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning [32.045840007623276]
We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning.
ViLa directly integrates perceptual data into its reasoning and planning process.
Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
arXiv Detail & Related papers (2023-11-29T17:46:25Z) - SayPlan: Grounding Large Language Models using 3D Scene Graphs for
Scalable Robot Task Planning [15.346150968195015]
We introduce SayPlan, a scalable approach to large-scale task planning for robotics using 3D scene graph (3DSG) representations.
We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects.
arXiv Detail & Related papers (2023-07-12T12:37:55Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - ProgPrompt: Generating Situated Robot Task Plans using Large Language
Models [68.57918965060787]
Large language models (LLMs) can be used to score potential next actions during task planning.
We present a programmatic LLM prompt structure that enables plan generation functional across situated environments.
arXiv Detail & Related papers (2022-09-22T20:29:49Z) - A Long Horizon Planning Framework for Manipulating Rigid Pointcloud
Objects [25.428781562909606]
We present a framework for solving long-horizon planning problems involving manipulation of rigid objects.
Our method plans in the space of object subgoals and frees the planner from reasoning about robot-object interaction dynamics.
arXiv Detail & Related papers (2020-11-16T18:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.