KOVIS: Keypoint-based Visual Servoing with Zero-Shot Sim-to-Real
Transfer for Robotics Manipulation
- URL: http://arxiv.org/abs/2007.13960v1
- Date: Tue, 28 Jul 2020 02:53:28 GMT
- Title: KOVIS: Keypoint-based Visual Servoing with Zero-Shot Sim-to-Real
Transfer for Robotics Manipulation
- Authors: En Yen Puang and Keng Peng Tee and Wei Jing
- Abstract summary: KOVIS is a learning-based, calibration-free visual servoing method for fine robotic manipulation tasks with eye-in-hand stereo camera system.
We train the deep neural network only in the simulated environment.
We demonstrate the effectiveness of the proposed method in both simulated environment and real-world experiment with different robotic manipulation tasks.
- Score: 8.81267687440119
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present KOVIS, a novel learning-based, calibration-free visual servoing
method for fine robotic manipulation tasks with eye-in-hand stereo camera
system. We train the deep neural network only in the simulated environment; and
the trained model could be directly used for real-world visual servoing tasks.
KOVIS consists of two networks. The first keypoint network learns the keypoint
representation from the image using with an autoencoder. Then the visual
servoing network learns the motion based on keypoints extracted from the camera
image. The two networks are trained end-to-end in the simulated environment by
self-supervised learning without manual data labeling. After training with data
augmentation, domain randomization, and adversarial examples, we are able to
achieve zero-shot sim-to-real transfer to real-world robotic manipulation
tasks. We demonstrate the effectiveness of the proposed method in both
simulated environment and real-world experiment with different robotic
manipulation tasks, including grasping, peg-in-hole insertion with 4mm
clearance, and M13 screw insertion. The demo video is available at
http://youtu.be/gfBJBR2tDzA
Related papers
- KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data [45.25288643161976]
We propose Keypoint Affordance Learning from Imagined Environments (KALIE) for robotic control in a scalable manner.
Instead of directly producing motor commands, KALIE controls the robot by predicting point-based affordance representations.
We demonstrate that KALIE can learn to robustly solve new manipulation tasks with unseen objects given only 50 example data points.
arXiv Detail & Related papers (2024-09-21T08:45:16Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - Markerless Camera-to-Robot Pose Estimation via Self-supervised
Sim-to-Real Transfer [26.21320177775571]
We propose an end-to-end pose estimation framework that is capable of online camera-to-robot calibration and a self-supervised training method.
Our framework combines deep learning and geometric vision for solving the robot pose, and the pipeline is fully differentiable.
arXiv Detail & Related papers (2023-02-28T05:55:42Z) - Real-World Robot Learning with Masked Visual Pre-training [161.88981509645416]
In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks.
Our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and then passed into a learnable control module.
We find that our encoder consistently outperforms CLIP (up to 75%), supervised ImageNet pre-training (up to 81%), and training from scratch (up to 81%).
arXiv Detail & Related papers (2022-10-06T17:59:01Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - Graph Neural Networks for Relational Inductive Bias in Vision-based Deep
Reinforcement Learning of Robot Control [0.0]
This work introduces a neural network architecture that combines relational inductive bias and visual feedback to learn an efficient position control policy.
We derive a graph representation that models the robot's internal state with a low-dimensional description of the visual scene generated by an image encoding network.
We show the ability of the model to improve sample efficiency for a 6-DoF robot arm in a visually realistic 3D environment.
arXiv Detail & Related papers (2022-03-11T15:11:54Z) - End-to-end Reinforcement Learning of Robotic Manipulation with Robust
Keypoints Representation [7.374994747693731]
We present an end-to-end Reinforcement Learning framework for robotic manipulation tasks, using a robust and efficient keypoints representation.
The proposed method learns keypoints from camera images as the state representation, through a self-supervised autoencoder architecture.
We demonstrate the effectiveness of the proposed method on robotic manipulation tasks including grasping and pushing, in different scenarios.
arXiv Detail & Related papers (2022-02-12T09:58:09Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.