Multimodal Visual-Tactile Representation Learning through
Self-Supervised Contrastive Pre-Training
- URL: http://arxiv.org/abs/2401.12024v1
- Date: Mon, 22 Jan 2024 15:11:57 GMT
- Title: Multimodal Visual-Tactile Representation Learning through
Self-Supervised Contrastive Pre-Training
- Authors: Vedant Dave, Fotios Lygerakis, Elmar Rueckert
- Abstract summary: MViTac is a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion.
By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction.
- Score: 0.850206009406913
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapidly evolving field of robotics necessitates methods that can
facilitate the fusion of multiple modalities. Specifically, when it comes to
interacting with tangible objects, effectively combining visual and tactile
sensory data is key to understanding and navigating the complex dynamics of the
physical world, enabling a more nuanced and adaptable response to changing
environments. Nevertheless, much of the earlier work in merging these two
sensory modalities has relied on supervised methods utilizing datasets labeled
by humans.This paper introduces MViTac, a novel methodology that leverages
contrastive learning to integrate vision and touch sensations in a
self-supervised fashion. By availing both sensory inputs, MViTac leverages
intra and inter-modality losses for learning representations, resulting in
enhanced material property classification and more adept grasping prediction.
Through a series of experiments, we showcase the effectiveness of our method
and its superiority over existing state-of-the-art self-supervised and
supervised techniques. In evaluating our methodology, we focus on two distinct
tasks: material classification and grasping success prediction. Our results
indicate that MViTac facilitates the development of improved modality encoders,
yielding more robust representations as evidenced by linear probing
assessments.
Related papers
- From Pretext to Purpose: Batch-Adaptive Self-Supervised Learning [32.18543787821028]
This paper proposes an adaptive technique of batch fusion for self-supervised contrastive learning.
It achieves state-of-the-art performance under equitable comparisons.
We suggest that the proposed method may contribute to the advancement of data-driven self-supervised learning research.
arXiv Detail & Related papers (2023-11-16T15:47:49Z) - The Power of the Senses: Generalizable Manipulation from Vision and
Touch through Masked Multimodal Learning [60.91637862768949]
We propose Masked Multimodal Learning (M3L) to fuse visual and tactile information in a reinforcement learning setting.
M3L learns a policy and visual-tactile representations based on masked autoencoding.
We evaluate M3L on three simulated environments with both visual and tactile observations.
arXiv Detail & Related papers (2023-11-02T01:33:00Z) - Compositional Learning in Transformer-Based Human-Object Interaction
Detection [6.630793383852106]
Long-tailed distribution of labeled instances is a primary challenge in HOI detection.
Inspired by the nature of HOI triplets, some existing approaches adopt the idea of compositional learning.
We creatively propose a transformer-based framework for compositional HOI learning.
arXiv Detail & Related papers (2023-08-11T06:41:20Z) - TASKED: Transformer-based Adversarial learning for human activity
recognition using wearable sensors via Self-KnowledgE Distillation [6.458496335718508]
We propose a novel Transformer-based Adversarial learning framework for human activity recognition using wearable sensors via Self-KnowledgE Distillation (TASKED)
In the proposed method, we adopt the teacher-free self-knowledge distillation to improve the stability of the training procedure and the performance of human activity recognition.
arXiv Detail & Related papers (2022-09-14T11:08:48Z) - MMLatch: Bottom-up Top-down Fusion for Multimodal Sentiment Analysis [84.7287684402508]
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high and mid-level latent modality representations.
Models of human perception highlight the importance of top-down fusion, where high-level representations affect the way sensory inputs are perceived.
We propose a neural architecture that captures top-down cross-modal interactions, using a feedback mechanism in the forward pass during network training.
arXiv Detail & Related papers (2022-01-24T17:48:04Z) - Attentive Cross-modal Connections for Deep Multimodal Wearable-based
Emotion Recognition [7.559720049837459]
We present a novel attentive cross-modal connection to share information between convolutional neural networks.
Specifically, these connections improve emotion classification by sharing intermediate representations among EDA and ECG.
Our experiments show that the proposed approach is capable of learning strong multimodal representations and outperforms a number of baselines methods.
arXiv Detail & Related papers (2021-08-04T18:40:32Z) - Visual Adversarial Imitation Learning using Variational Models [60.69745540036375]
Reward function specification remains a major impediment for learning behaviors through deep reinforcement learning.
Visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents.
We develop a variational model-based adversarial imitation learning algorithm.
arXiv Detail & Related papers (2021-07-16T00:15:18Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision
Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos)
The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.