Look Closer: Bridging Egocentric and Third-Person Views with
Transformers for Robotic Manipulation
- URL: http://arxiv.org/abs/2201.07779v2
- Date: Thu, 20 Jan 2022 10:12:14 GMT
- Title: Look Closer: Bridging Egocentric and Third-Person Views with
Transformers for Robotic Manipulation
- Authors: Rishabh Jangir, Nicklas Hansen, Sambaran Ghosal, Mohit Jain, Xiaolong
Wang
- Abstract summary: Learning to solve precision-based manipulation tasks from visual feedback could drastically reduce the engineering efforts required by traditional robot systems.
We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist.
To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism.
- Score: 15.632809977544907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to solve precision-based manipulation tasks from visual feedback
using Reinforcement Learning (RL) could drastically reduce the engineering
efforts required by traditional robot systems. However, performing fine-grained
motor control from visual inputs alone is challenging, especially with a static
third-person camera as often used in previous work. We propose a setting for
robotic manipulation in which the agent receives visual feedback from both a
third-person camera and an egocentric camera mounted on the robot's wrist.
While the third-person camera is static, the egocentric camera enables the
robot to actively control its vision to aid in precise manipulation. To fuse
visual information from both cameras effectively, we additionally propose to
use Transformers with a cross-view attention mechanism that models spatial
attention from one view to another (and vice-versa), and use the learned
features as input to an RL policy. Our method improves learning over strong
single-view and multi-view baselines, and successfully transfers to a set of
challenging manipulation tasks on a real robot with uncalibrated cameras, no
access to state information, and a high degree of task variability. In a hammer
manipulation task, our method succeeds in 75% of trials versus 38% and 13% for
multi-view and single-view baselines, respectively.
Related papers
- Open-TeleVision: Teleoperation with Immersive Active Visual Feedback [17.505318269362512]
Open-TeleVision allows operators to actively perceive the robot's surroundings in a stereoscopic manner.
The system mirrors the operator's arm and hand movements on the robot, creating an immersive experience.
We validate the effectiveness of our system by collecting data and training imitation learning policies on four long-horizon, precise tasks.
arXiv Detail & Related papers (2024-07-01T17:55:35Z) - Vision-based Manipulation from Single Human Video with Open-World Object Graphs [58.23098483464538]
We present an object-centric approach to empower robots to learn vision-based manipulation skills from human videos.
We introduce ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video.
arXiv Detail & Related papers (2024-05-30T17:56:54Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Multi-View Masked World Models for Visual Robotic Manipulation [132.97980128530017]
We train a multi-view masked autoencoder which reconstructs pixels of randomly masked viewpoints.
We demonstrate the effectiveness of our method in a range of scenarios.
We also show that the multi-view masked autoencoder trained with multiple randomized viewpoints enables training a policy with strong viewpoint randomization.
arXiv Detail & Related papers (2023-02-05T15:37:02Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - From One Hand to Multiple Hands: Imitation Learning for Dexterous
Manipulation from Single-Camera Teleoperation [26.738893736520364]
We introduce a novel single-camera teleoperation system to collect the 3D demonstrations efficiently with only an iPad and a computer.
We construct a customized robot hand for each user in the physical simulator, which is a manipulator resembling the same kinematics structure and shape of the operator's hand.
With imitation learning using our data, we show large improvement over baselines with multiple complex manipulation tasks.
arXiv Detail & Related papers (2022-04-26T17:59:51Z) - Scene Editing as Teleoperation: A Case Study in 6DoF Kit Assembly [18.563562557565483]
We propose the framework "Scene Editing as Teleoperation" (SEaT)
Instead of controlling the robot, users focus on specifying the task's goal.
A user can perform teleoperation without any expert knowledge of the robot hardware.
arXiv Detail & Related papers (2021-10-09T04:22:21Z) - Morphology-Agnostic Visual Robotic Control [76.44045983428701]
MAVRIC is an approach that works with minimal prior knowledge of the robot's morphology.
We demonstrate our method on visually-guided 3D point reaching, trajectory following, and robot-to-robot imitation.
arXiv Detail & Related papers (2019-12-31T15:45:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.