Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning
- URL: http://arxiv.org/abs/2311.17842v2
- Date: Sun, 24 Dec 2023 03:48:40 GMT
- Title: Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning
- Authors: Yingdong Hu, Fanqi Lin, Tong Zhang, Li Yi, Yang Gao
- Abstract summary: We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning.
ViLa directly integrates perceptual data into its reasoning and planning process.
Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
- Score: 32.045840007623276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we are interested in imbuing robots with the capability of
physically-grounded task planning. Recent advancements have shown that large
language models (LLMs) possess extensive knowledge useful in robotic tasks,
especially in reasoning and planning. However, LLMs are constrained by their
lack of world grounding and dependence on external affordance models to
perceive environmental information, which cannot jointly reason with LLMs. We
argue that a task planner should be an inherently grounded, unified multimodal
system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a
novel approach for long-horizon robotic planning that leverages vision-language
models (VLMs) to generate a sequence of actionable steps. ViLa directly
integrates perceptual data into its reasoning and planning process, enabling a
profound understanding of commonsense knowledge in the visual world, including
spatial layouts and object attributes. It also supports flexible multimodal
goal specification and naturally incorporates visual feedback. Our extensive
evaluation, conducted in both real-robot and simulated environments,
demonstrates ViLa's superiority over existing LLM-based planners, highlighting
its effectiveness in a wide array of open-world manipulation tasks.
Related papers
- VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model [4.557035895252272]
Vision Language Models (VLMs) have been adopted in robotics for their capability in common sense reasoning and generalizability.
In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning.
We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''
arXiv Detail & Related papers (2024-10-11T13:17:52Z) - Solving Robotics Problems in Zero-Shot with Vision-Language Models [0.0]
We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework designed to solve robotics problems in a zero-shot regime.
In our context, zero-shot means that for a novel environment, we provide a VLLM with an image of the robot's surroundings and a task description.
Our system showcases the ability to handle diverse tasks such as manipulation, goal-reaching, and visual reasoning -- all in a zero-shot manner.
arXiv Detail & Related papers (2024-07-26T21:18:57Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - Towards Open-World Grasping with Large Vision-Language Models [5.317624228510749]
An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning.
We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models.
We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
arXiv Detail & Related papers (2024-06-26T19:42:08Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Large Language Models for Robotics: Opportunities, Challenges, and
Perspectives [46.57277568357048]
Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains.
For embodied tasks, where robots interact with complex environments, text-only LLMs often face challenges due to a lack of compatibility with robotic visual perception.
We propose a framework that utilizes multimodal GPT-4V to enhance embodied task planning through the combination of natural language instructions and robot visual perceptions.
arXiv Detail & Related papers (2024-01-09T03:22:16Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning [84.6451394629312]
We introduce EgoPlan-Bench, a benchmark to evaluate the planning abilities of MLLMs in real-world scenarios.
We show that EgoPlan-Bench poses significant challenges, highlighting a substantial scope for improvement in MLLMs to achieve human-level task planning.
We also present EgoPlan-IT, a specialized instruction-tuning dataset that effectively enhances model performance on EgoPlan-Bench.
arXiv Detail & Related papers (2023-12-11T03:35:58Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - Chat with the Environment: Interactive Multimodal Perception Using Large
Language Models [19.623070762485494]
Large Language Models (LLMs) have shown remarkable reasoning ability in few-shot robotic planning.
Our study demonstrates that LLMs can provide high-level planning and reasoning skills and control interactive robot behavior in a multimodal environment.
arXiv Detail & Related papers (2023-03-14T23:01:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.