MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual
Prompting
- URL: http://arxiv.org/abs/2403.03174v1
- Date: Tue, 5 Mar 2024 18:08:45 GMT
- Title: MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual
Prompting
- Authors: Fangchen Liu, Kuan Fang, Pieter Abbeel, Sergey Levine
- Abstract summary: We present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that employs vision language models to solve robotic manipulation tasks.
At the heart of our approach is a compact point-based representation of affordance and motion that bridges the VLM's predictions on RGB images and the robot's motions in the physical world.
We evaluate and analyze MOKA's performance on a variety of manipulation tasks specified by free-form language descriptions.
- Score: 106.53784213239479
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary generalization requires robotic systems to perform tasks
involving complex and diverse environments and task goals. While the recent
advances in vision language models (VLMs) present unprecedented opportunities
to solve unseen problems, how to utilize their emergent capabilities to control
robots in the physical world remains an open question. In this paper, we
present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that
employs VLMs to solve robotic manipulation tasks specified by free-form
language descriptions. At the heart of our approach is a compact point-based
representation of affordance and motion that bridges the VLM's predictions on
RGB images and the robot's motions in the physical world. By prompting a VLM
pre-trained on Internet-scale data, our approach predicts the affordances and
generates the corresponding motions by leveraging the concept understanding and
commonsense knowledge from broad sources. To scaffold the VLM's reasoning in
zero-shot, we propose a visual prompting technique that annotates marks on the
images, converting the prediction of keypoints and waypoints into a series of
visual question answering problems that are feasible for the VLM to solve.
Using the robot experiences collected in this way, we further investigate ways
to bootstrap the performance through in-context learning and policy
distillation. We evaluate and analyze MOKA's performance on a variety of
manipulation tasks specified by free-form language descriptions, such as tool
use, deformable body manipulation, and object rearrangement.
Related papers
- Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation [90.00687889213991]
Solving complex long-horizon robotic manipulation problems requires sophisticated high-level planning capabilities.
Vision-language models (VLMs) pretrained on Internet data could in principle offer a framework for tackling such problems.
In this paper, we introduce a novel test-time framework that enhancesVLMs' physical reasoning capabilities for multi-stage manipulation tasks.
arXiv Detail & Related papers (2025-02-23T20:42:15Z) - VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model [4.557035895252272]
Vision Language Models (VLMs) have been adopted in robotics for their capability in common sense reasoning and generalizability.
In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning.
We named it SeeDo because it enables the VLM to ''see'' human demonstrations and explain the corresponding plans to the robot for it to ''do''
arXiv Detail & Related papers (2024-10-11T13:17:52Z) - Solving Robotics Problems in Zero-Shot with Vision-Language Models [0.0]
We introduce Wonderful Team, a multi-agent Vision Large Language Model (VLLM) framework designed to solve robotics problems in a zero-shot regime.
In our context, zero-shot means that for a novel environment, we provide a VLLM with an image of the robot's surroundings and a task description.
Our system showcases the ability to handle diverse tasks such as manipulation, goal-reaching, and visual reasoning -- all in a zero-shot manner.
arXiv Detail & Related papers (2024-07-26T21:18:57Z) - LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.
First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.
We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z) - PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - QUAR-VLA: Vision-Language-Action Model for Quadruped Robots [37.952398683031895]
The central idea is to elevate the overall intelligence of the robot.
We propose QUAdruped Robotic Transformer (QUART), a family of VLA models to integrate visual information and instructions from diverse modalities as input.
Our approach leads to performant robotic policies and enables QUART to obtain a range of emergent capabilities.
arXiv Detail & Related papers (2023-12-22T06:15:03Z) - Vision-Language Foundation Models as Effective Robot Imitators [48.73027330407576]
We derive a vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo.
By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control.
arXiv Detail & Related papers (2023-11-02T16:34:33Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z) - AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot
Manipulation [50.737355245505334]
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks.
The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
arXiv Detail & Related papers (2023-05-30T09:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.