PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
- URL: http://arxiv.org/abs/2402.07872v1
- Date: Mon, 12 Feb 2024 18:33:47 GMT
- Title: PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs
- Authors: Soroush Nasiriany, Fei Xia, Wenhao Yu, Ted Xiao, Jacky Liang, Ishita
Dasgupta, Annie Xie, Danny Driess, Ayzaan Wahid, Zhuo Xu, Quan Vuong, Tingnan
Zhang, Tsang-Wei Edward Lee, Kuang-Huei Lee, Peng Xu, Sean Kirmani, Yuke Zhu,
Andy Zeng, Karol Hausman, Nicolas Heess, Chelsea Finn, Sergey Levine, Brian
Ichter
- Abstract summary: Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
- Score: 140.14239499047977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision language models (VLMs) have shown impressive capabilities across a
variety of tasks, from logical reasoning to visual understanding. This opens
the door to richer interaction with the world, for example robotic control.
However, VLMs produce only textual outputs, while robotic control and other
spatial tasks require outputting continuous coordinates, actions, or
trajectories. How can we enable VLMs to handle such settings without
fine-tuning on task-specific data?
In this paper, we propose a novel visual prompting approach for VLMs that we
call Prompting with Iterative Visual Optimization (PIVOT), which casts tasks as
iterative visual question answering. In each iteration, the image is annotated
with a visual representation of proposals that the VLM can refer to (e.g.,
candidate robot actions, localizations, or trajectories). The VLM then selects
the best ones for the task. These proposals are iteratively refined, allowing
the VLM to eventually zero in on the best available answer. We investigate
PIVOT on real-world robotic navigation, real-world manipulation from images,
instruction following in simulation, and additional spatial inference tasks
such as localization. We find, perhaps surprisingly, that our approach enables
zero-shot control of robotic systems without any robot training data,
navigation in a variety of environments, and other capabilities. Although
current performance is far from perfect, our work highlights potentials and
limitations of this new regime and shows a promising approach for
Internet-Scale VLMs in robotic and spatial reasoning domains. Website:
pivot-prompt.github.io and HuggingFace:
https://huggingface.co/spaces/pivot-prompt/pivot-prompt-demo.
Related papers
- Visual Language Models as Operator Agents in the Space Domain [36.943670587532026]
Vision-Language Models (VLMs) can enhance autonomous control and decision-making in space missions.
In the software context, we employ VLMs to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers.
In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites.
arXiv Detail & Related papers (2025-01-14T03:03:37Z) - VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model [4.557035895252272]
Vision Language Models (VLMs) have been adopted in robotics for their capability in common sense reasoning and generalizability.
In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning.
We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''
arXiv Detail & Related papers (2024-10-11T13:17:52Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Look Before You Leap: Unveiling the Power of GPT-4V in Robotic
Vision-Language Planning [32.045840007623276]
We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning.
ViLa directly integrates perceptual data into its reasoning and planning process.
Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
arXiv Detail & Related papers (2023-11-29T17:46:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.