Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning
- URL: http://arxiv.org/abs/2311.17842v2
- Date: Sun, 24 Dec 2023 03:48:40 GMT
- Title: Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning
- Authors: Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, Yang Gao
- Abstract summary: We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning.
ViLa directly integrates perceptual data into its reasoning and planning process.
Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
- Score: 32.045840007623276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we are interested in imbuing robots with the capability of
physically-grounded task planning. Recent advancements have shown that large
language models (LLMs) possess extensive knowledge useful in robotic tasks,
especially in reasoning and planning. However, LLMs are constrained by their
lack of world grounding and dependence on external affordance models to
perceive environmental information, which cannot jointly reason with LLMs. We
argue that a task planner should be an inherently grounded, unified multimodal
system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a
novel approach for long-horizon robotic planning that leverages vision-language
models (VLMs) to generate a sequence of actionable steps. ViLa directly
integrates perceptual data into its reasoning and planning process, enabling a
profound understanding of commonsense knowledge in the visual world, including
spatial layouts and object attributes. It also supports flexible multimodal
goal specification and naturally incorporates visual feedback. Our extensive
evaluation, conducted in both real-robot and simulated environments,
demonstrates ViLa's superiority over existing LLM-based planners, highlighting
its effectiveness in a wide array of open-world manipulation tasks.
Related papers
- VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains.
We propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - Towards Open-World Grasping with Large Vision-Language Models [5.317624228510749]
An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning.
We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models.
We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
arXiv Detail & Related papers (2024-06-26T19:42:08Z) - Exploring Unseen Environments with Robots using Large Language and Vision Models through a Procedurally Generated 3D Scene Representation [0.979851640406258]
This work focuses on solving the object goal navigation problem by mimicking human cognition.
We introduce a comprehensive framework capable of exploring an unfamiliar environment in search of an object.
A challenging task in using LLMs to generate high level sub-goals is to efficiently represent the environment around the robot.
arXiv Detail & Related papers (2024-03-30T10:54:59Z) - Large Language Models for Robotics: Opportunities, Challenges, and
Perspectives [46.57277568357048]
Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains.
For embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception.
We propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.
arXiv Detail & Related papers (2024-01-09T03:22:16Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.