Related papers: Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

URL: http://arxiv.org/abs/2311.17842v2
Date: Sun, 24 Dec 2023 03:48:40 GMT
Title: Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning
Authors: Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, Yang Gao
Abstract summary: We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning. ViLa directly integrates perceptual data into its reasoning and planning process. Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
Score: 32.045840007623276
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.

Related papers

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation [62.711546725154314]
We introduce Gondola, a grounded vision-language planning model based on large language models (LLMs) for generalizable robotic manipulation.<n>G Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations.<n>G Gondola outperforms the state-of-the-art LLM-based method across all four levels of the GemBench dataset.
arXiv Detail & Related papers (2025-06-12T20:04:31Z)
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model [4.557035895252272]
Vision Language Models (VLMs) have been adopted in robotics for their capability in common sense reasoning and generalizability. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''
arXiv Detail & Related papers (2024-10-11T13:17:52Z)
Solving Robotics Problems in Zero-Shot with Vision-Language Models [0.0]
We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework designed to solve robotics problems in a zero-shot regime. In our context, zero-shot means that for a novel environment, we provide a VLLM with an image of the robot's surroundings and a task description. Our system showcases the ability to handle diverse tasks such as manipulation, goal-reaching, and visual reasoning -- all in a zero-shot manner.
arXiv Detail & Related papers (2024-07-26T21:18:57Z)
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs) One understudied capability inVLMs is visual spatial planning. Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
Towards Open-World Grasping with Large Vision-Language Models [5.317624228510749]
An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning. We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models. We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
arXiv Detail & Related papers (2024-06-26T19:42:08Z)
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z)
Large Language Models for Robotics: Opportunities, Challenges, and Perspectives [46.57277568357048]
Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. For embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception. We propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.
arXiv Detail & Related papers (2024-01-09T03:22:16Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios. We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning. We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z)
WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system. We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration. We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z)
Chat with the Environment: Interactive Multimodal Perception Using Large Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning. Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.