The Power of the Senses: Generalizable Manipulation from Vision and
Touch through Masked Multimodal Learning
- URL: http://arxiv.org/abs/2311.00924v1
- Date: Thu, 2 Nov 2023 01:33:00 GMT
- Title: The Power of the Senses: Generalizable Manipulation from Vision and
Touch through Masked Multimodal Learning
- Authors: Carmelo Sferrazza, Younggyo Seo, Hao Liu, Youngwoon Lee, Pieter Abbeel
- Abstract summary: We propose Masked Multimodal Learning (M3L) to fuse visual and tactile information in a reinforcement learning setting.
M3L learns a policy and visual-tactile representations based on masked autoencoding.
We evaluate M3L on three simulated environments with both visual and tactile observations.
- Score: 60.91637862768949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans rely on the synergy of their senses for most essential tasks. For
tasks requiring object manipulation, we seamlessly and effectively exploit the
complementarity of our senses of vision and touch. This paper draws inspiration
from such capabilities and aims to find a systematic approach to fuse visual
and tactile information in a reinforcement learning setting. We propose Masked
Multimodal Learning (M3L), which jointly learns a policy and visual-tactile
representations based on masked autoencoding. The representations jointly
learned from vision and touch improve sample efficiency, and unlock
generalization capabilities beyond those achievable through each of the senses
separately. Remarkably, representations learned in a multimodal setting also
benefit vision-only policies at test time. We evaluate M3L on three simulated
environments with both visual and tactile observations: robotic insertion, door
opening, and dexterous in-hand manipulation, demonstrating the benefits of
learning a multimodal policy. Code and videos of the experiments are available
at https://sferrazza.cc/m3l_site.
Related papers
- 3D-ViTac: Learning Fine-Grained Manipulation with Visuo-Tactile Sensing [18.189782619503074]
This paper introduces textbf3D-ViTac, a multi-modal sensing and learning system for robots.
Our system features tactile sensors equipped with dense sensing units, each covering an area of 3$mm2$.
We show that even low-cost robots can perform precise manipulations and significantly outperform vision-only policies.
arXiv Detail & Related papers (2024-10-31T16:22:53Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - Multimodal Visual-Tactile Representation Learning through
Self-Supervised Contrastive Pre-Training [0.850206009406913]
MViTac is a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion.
By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction.
arXiv Detail & Related papers (2024-01-22T15:11:57Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Learning Robust Visual-Semantic Embedding for Generalizable Person
Re-identification [11.562980171753162]
Generalizable person re-identification (Re-ID) is a very hot research topic in machine learning and computer vision.
Previous methods mainly focus on the visual representation learning, while neglect to explore the potential of semantic features during training.
We propose a Multi-Modal Equivalent Transformer called MMET for more robust visual-semantic embedding learning.
arXiv Detail & Related papers (2023-04-19T08:37:25Z) - Multi-Modal Masked Autoencoders for Medical Vision-and-Language
Pre-Training [62.215025958347105]
We propose a self-supervised learning paradigm with multi-modal masked autoencoders.
We learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts.
arXiv Detail & Related papers (2022-09-15T07:26:43Z) - Vision-Based Manipulators Need to Also See from Their Hands [58.398637422321976]
We study how the choice of visual perspective affects learning and generalization in the context of physical manipulation from raw sensor observations.
We find that a hand-centric (eye-in-hand) perspective affords reduced observability, but it consistently improves training efficiency and out-of-distribution generalization.
arXiv Detail & Related papers (2022-03-15T18:46:18Z) - Multimodal perception for dexterous manipulation [14.314776558032166]
We propose a cross-modal sensory data generation framework for the translation between vision and touch.
We propose a-temporal attention model for tactile texture recognition, which takes both spatial features and time dimension into consideration.
arXiv Detail & Related papers (2021-12-28T21:20:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.