Vision-Based Manipulators Need to Also See from Their Hands
- URL: http://arxiv.org/abs/2203.12677v1
- Date: Tue, 15 Mar 2022 18:46:18 GMT
- Title: Vision-Based Manipulators Need to Also See from Their Hands
- Authors: Kyle Hsu, Moo Jin Kim, Rafael Rafailov, Jiajun Wu, Chelsea Finn
- Abstract summary: We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations.
We find that a hand-centric (eye-in-hand) perspective affords reduced observability, but it consistently improves training efficiency and out-of-distribution generalization.
- Score: 58.398637422321976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study how the choice of visual perspective affects learning and
generalization in the context of physical manipulation from raw sensor
observations. Compared with the more commonly used global third-person
perspective, a hand-centric (eye-in-hand) perspective affords reduced
observability, but we find that it consistently improves training efficiency
and out-of-distribution generalization. These benefits hold across a variety of
learning algorithms, experimental settings, and distribution shifts, and for
both simulated and real robot apparatuses. However, this is only the case when
hand-centric observability is sufficient; otherwise, including a third-person
perspective is necessary for learning, but also harms out-of-distribution
generalization. To mitigate this, we propose to regularize the third-person
information stream via a variational information bottleneck. On six
representative manipulation tasks with varying hand-centric observability
adapted from the Meta-World benchmark, this results in a state-of-the-art
reinforcement learning agent operating from both perspectives improving its
out-of-distribution generalization on every task. While some practitioners have
long put cameras in the hands of robots, our work systematically analyzes the
benefits of doing so and provides simple and broadly applicable insights for
improving end-to-end learned vision-based robotic manipulation.
Related papers
- When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Towards Unsupervised Representation Learning: Learning, Evaluating and
Transferring Visual Representations [1.8130068086063336]
We contribute to the field of unsupervised (visual) representation learning from three perspectives.
We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs)
We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics.
We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on self-supervised learning.
arXiv Detail & Related papers (2023-11-30T15:57:55Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - The Power of the Senses: Generalizable Manipulation from Vision and
Touch through Masked Multimodal Learning [60.91637862768949]
We propose Masked Multimodal Learning (M3L) to fuse visual and tactile information in a reinforcement learning setting.
M3L learns a policy and visual-tactile representations based on masked autoencoding.
We evaluate M3L on three simulated environments with both visual and tactile observations.
arXiv Detail & Related papers (2023-11-02T01:33:00Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - See, Hear, and Feel: Smart Sensory Fusion for Robotic Manipulation [49.925499720323806]
We study how visual, auditory, and tactile perception can jointly help robots to solve complex manipulation tasks.
We build a robot system that can see with a camera, hear with a contact microphone, and feel with a vision-based tactile sensor.
arXiv Detail & Related papers (2022-12-07T18:55:53Z) - Towards self-attention based visual navigation in the real world [0.0]
Vision guided navigation requires processing complex visual information to inform task-orientated decisions.
Deep Reinforcement Learning agents trained in simulation often exhibit unsatisfactory results when deployed in the real-world.
This is the first demonstration of a self-attention based agent successfully trained in navigating a 3D action space, using less than 4000 parameters.
arXiv Detail & Related papers (2022-09-15T04:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.