Visuo-Tactile Transformers for Manipulation
- URL: http://arxiv.org/abs/2210.00121v1
- Date: Fri, 30 Sep 2022 22:38:29 GMT
- Title: Visuo-Tactile Transformers for Manipulation
- Authors: Yizhou Chen, Andrea Sipos, Mark Van der Merwe, Nima Fazeli
- Abstract summary: We present Visuo-Tactile Transformers (VTTs), a novel multimodal representation learning approach suited for model-based reinforcement learning and planning.
Specifically, VTT uses tactile feedback together with self and cross-modal attention to build latent heatmap representations that focus attention on important task features in the visual domain.
- Score: 4.60687205898687
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning representations in the joint domain of vision and touch can improve
manipulation dexterity, robustness, and sample-complexity by exploiting mutual
information and complementary cues. Here, we present Visuo-Tactile Transformers
(VTTs), a novel multimodal representation learning approach suited for
model-based reinforcement learning and planning. Our approach extends the
Visual Transformer \cite{dosovitskiy2021image} to handle visuo-tactile
feedback. Specifically, VTT uses tactile feedback together with self and
cross-modal attention to build latent heatmap representations that focus
attention on important task features in the visual domain. We demonstrate the
efficacy of VTT for representation learning with a comparative evaluation
against baselines on four simulated robot tasks and one real world block
pushing task. We conduct an ablation study over the components of VTT to
highlight the importance of cross-modality in representation learning.
Related papers
- Multimodal Visual-Tactile Representation Learning through
Self-Supervised Contrastive Pre-Training [0.850206009406913]
MViTac is a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion.
By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction.
arXiv Detail & Related papers (2024-01-22T15:11:57Z) - ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities.
We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG.
By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - Visual Adversarial Imitation Learning using Variational Models [60.69745540036375]
Reward function specification remains a major impediment for learning behaviors through deep reinforcement learning.
Visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents.
We develop a variational model-based adversarial imitation learning algorithm.
arXiv Detail & Related papers (2021-07-16T00:15:18Z) - Learning Task Informed Abstractions [10.920599910769276]
We propose learning Task Informed Abstractions (TIA) that explicitly separates reward-correlated visual features from distractors.
TIA leads to significant performance gains over state-of-the-art methods on many visual control tasks.
arXiv Detail & Related papers (2021-06-29T17:56:11Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.