OK-Robot: What Really Matters in Integrating Open-Knowledge Models for
Robotics
- URL: http://arxiv.org/abs/2401.12202v2
- Date: Thu, 29 Feb 2024 17:20:08 GMT
- Title: OK-Robot: What Really Matters in Integrating Open-Knowledge Models for
Robotics
- Authors: Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Muhammad Mahi
Shafiullah, Lerrel Pinto
- Abstract summary: We develop a new Open Knowledge-based robotics framework called OK-Robot.
By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training.
Results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks.
- Score: 26.73838656137223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Remarkable progress has been made in recent years in the fields of vision,
language, and robotics. We now have vision models capable of recognizing
objects based on language queries, navigation systems that can effectively
control mobile systems, and grasping models that can handle a wide range of
objects. Despite these advancements, general-purpose applications of robotics
still lag behind, even though they rely on these fundamental capabilities of
recognition, navigation, and grasping. In this paper, we adopt a systems-first
approach to develop a new Open Knowledge-based robotics framework called
OK-Robot. By combining Vision-Language Models (VLMs) for object detection,
navigation primitives for movement, and grasping primitives for object
manipulation, OK-Robot offers a integrated solution for pick-and-drop
operations without requiring any training. To evaluate its performance, we run
OK-Robot in 10 real-world home environments. The results demonstrate that
OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks,
representing a new state-of-the-art in Open Vocabulary Mobile Manipulation
(OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered
environments, OK-Robot's performance increases to 82%. However, the most
important insight gained from OK-Robot is the critical role of nuanced details
when combining Open Knowledge systems like VLMs with robotic modules. Videos of
our experiments and code are available on our website:
https://ok-robot.github.io
Related papers
- LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning [50.99807031490589]
We introduce LLARVA, a model trained with a novel instruction tuning method to unify a range of robotic learning tasks, scenarios, and environments.
We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model.
Experiments yield strong performance, demonstrating that LLARVA performs well compared to several contemporary baselines.
arXiv Detail & Related papers (2024-06-17T17:55:29Z) - RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics [46.63773228934993]
We introduce an automatic synthetic data generation pipeline that instruction-tunes vision language models (VLMs) to robotic domains and needs.
Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions.
Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks.
arXiv Detail & Related papers (2024-06-15T19:22:51Z) - Octo: An Open-Source Generalist Robot Policy [88.14295917143188]
We introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset.
It can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPU.
We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.
arXiv Detail & Related papers (2024-05-20T17:57:01Z) - OpenBot-Fleet: A System for Collective Learning with Real Robots [45.739144410591805]
We introduce OpenBot-Fleet, a comprehensive open-source cloud robotics system for navigation.
OpenBot-Fleet uses smartphones for sensing, local compute and communication, Google for secure cloud storage and off-board compute.
In experiments we distribute 72 robots to a crowd of workers who operate them in homes, and show that OpenBot-Fleet can learn robust navigation policies.
arXiv Detail & Related papers (2024-05-13T07:22:50Z) - Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers [36.497624484863785]
We introduce Vid2Robot, a novel end-to-end video-based learning framework for robots.
Given a video demonstration of a manipulation task and current visual observations, Vid2Robot directly produces robot actions.
This is achieved through a unified representation model trained on a large dataset of human video and robot trajectory.
arXiv Detail & Related papers (2024-03-19T17:47:37Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - HomeRobot: Open-Vocabulary Mobile Manipulation [107.05702777141178]
Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location.
HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch.
arXiv Detail & Related papers (2023-06-20T14:30:32Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.