MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual
Prompting
- URL: http://arxiv.org/abs/2403.03174v1
- Date: Tue, 5 Mar 2024 18:08:45 GMT
- Title: MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual
Prompting
- Authors: Fangchen Liu, Kuan Fang, Pieter Abbeel, Sergey Levine
- Abstract summary: We present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that employs vision language models to solve robotic manipulation tasks.
At the heart of our approach is a compact point-based representation of affordance and motion that bridges the VLM's predictions on RGB images and the robot's motions in the physical world.
We evaluate and analyze MOKA's performance on a variety of manipulation tasks specified by free-form language descriptions.
- Score: 106.53784213239479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary generalization requires robotic systems to perform tasks
involving complex and diverse environments and task goals. While the recent
advances in vision language models (VLMs) present unprecedented opportunities
to solve unseen problems, how to utilize their emergent capabilities to control
robots in the physical world remains an open question. In this paper, we
present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that
employs VLMs to solve robotic manipulation tasks specified by free-form
language descriptions. At the heart of our approach is a compact point-based
representation of affordance and motion that bridges the VLM's predictions on
RGB images and the robot's motions in the physical world. By prompting a VLM
pre-trained on Internet-scale data, our approach predicts the affordances and
generates the corresponding motions by leveraging the concept understanding and
commonsense knowledge from broad sources. To scaffold the VLM's reasoning in
zero-shot, we propose a visual prompting technique that annotates marks on the
images, converting the prediction of keypoints and waypoints into a series of
visual question answering problems that are feasible for the VLM to solve.
Using the robot experiences collected in this way, we further investigate ways
to bootstrap the performance through in-context learning and policy
distillation. We evaluate and analyze MOKA's performance on a variety of
manipulation tasks specified by free-form language descriptions, such as tool
use, deformable body manipulation, and object rearrangement.
Related papers
- LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains.
We propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Vision-Language Foundation Models as Effective Robot Imitators [48.73027330407576]
We derive a vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo.
By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control.
arXiv Detail & Related papers (2023-11-02T16:34:33Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.