OK-Robot: What Really Matters in Integrating Open-Knowledge Models for
Robotics
- URL: http://arxiv.org/abs/2401.12202v2
- Date: Thu, 29 Feb 2024 17:20:08 GMT
- Title: OK-Robot: What Really Matters in Integrating Open-Knowledge Models for
Robotics
- Authors: Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur Muhammad Mahi
Shafiullah, Lerrel Pinto
- Abstract summary: We develop a new Open Knowledge-based robotics framework called OK-Robot.
By combining Vision-Language Models (VLMs) for object detection, navigation primitives for movement, and grasping primitives for object manipulation, OK-Robot offers a integrated solution for pick-and-drop operations without requiring any training.
Results demonstrate that OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks.
- Score: 26.73838656137223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Remarkable progress has been made in recent years in the fields of vision,
language, and robotics. We now have vision models capable of recognizing
objects based on language queries, navigation systems that can effectively
control mobile systems, and grasping models that can handle a wide range of
objects. Despite these advancements, general-purpose applications of robotics
still lag behind, even though they rely on these fundamental capabilities of
recognition, navigation, and grasping. In this paper, we adopt a systems-first
approach to develop a new Open Knowledge-based robotics framework called
OK-Robot. By combining Vision-Language Models (VLMs) for object detection,
navigation primitives for movement, and grasping primitives for object
manipulation, OK-Robot offers a integrated solution for pick-and-drop
operations without requiring any training. To evaluate its performance, we run
OK-Robot in 10 real-world home environments. The results demonstrate that
OK-Robot achieves a 58.5% success rate in open-ended pick-and-drop tasks,
representing a new state-of-the-art in Open Vocabulary Mobile Manipulation
(OVMM) with nearly 1.8x the performance of prior work. On cleaner, uncluttered
environments, OK-Robot's performance increases to 82%. However, the most
important insight gained from OK-Robot is the critical role of nuanced details
when combining Open Knowledge systems like VLMs with robotic modules. Videos of
our experiments and code are available on our website:
https://ok-robot.github.io
Related papers
- $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - Open-TeleVision: Teleoperation with Immersive Active Visual Feedback [17.505318269362512]
Open-TeleVision allows operators to actively perceive the robot's surroundings in a stereoscopic manner.
The system mirrors the operator's arm and hand movements on the robot, creating an immersive experience.
We validate the effectiveness of our system by collecting data and training imitation learning policies on four long-horizon, precise tasks.
arXiv Detail & Related papers (2024-07-01T17:55:35Z) - RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics [46.63773228934993]
We introduce an automatic synthetic data generation pipeline that instruction-tunes vision language models (VLMs) to robotic domains and needs.
Using the pipeline, we train RoboPoint, a VLM that predicts image keypoint affordances given language instructions.
Our experiments demonstrate that RoboPoint outperforms state-of-the-art VLMs by 21.8% in the accuracy of predicting spatial affordance and by 30.5% in the success rate of downstream tasks.
arXiv Detail & Related papers (2024-06-15T19:22:51Z) - Octo: An Open-Source Generalist Robot Policy [88.14295917143188]
We introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset.
It can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPU.
We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.
arXiv Detail & Related papers (2024-05-20T17:57:01Z) - OpenBot-Fleet: A System for Collective Learning with Real Robots [45.739144410591805]
We introduce OpenBot-Fleet, a comprehensive open-source cloud robotics system for navigation.
OpenBot-Fleet uses smartphones for sensing, local compute and communication, Google for secure cloud storage and off-board compute.
In experiments we distribute 72 robots to a crowd of workers who operate them in homes, and show that OpenBot-Fleet can learn robust navigation policies.
arXiv Detail & Related papers (2024-05-13T07:22:50Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - HomeRobot: Open-Vocabulary Mobile Manipulation [107.05702777141178]
Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location.
HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch.
arXiv Detail & Related papers (2023-06-20T14:30:32Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.