Related papers: ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning

ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning

URL: http://arxiv.org/abs/2602.11643v1
Date: Thu, 12 Feb 2026 06:56:29 GMT
Title: ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning
Authors: Yufeng Tian, Shuiqi Cheng, Tianming Wei, Tianxing Zhou, Yuanhang Zhang, Zixian Liu, Qianwei Han, Zhecheng Yuan, Huazhe Xu,
Abstract summary: We present ViTaS, a framework that incorporates both visual and tactile information to guide the behavior of an agent.<n>We demonstrate the effectiveness of our method in 12 simulated and 3 real-world environments.
Score: 33.49725304395789
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tactile information plays a crucial role in human manipulation tasks and has recently garnered increasing attention in robotic manipulation. However, existing approaches mostly focus on the alignment of visual and tactile features and the integration mechanism tends to be direct concatenation. Consequently, they struggle to effectively cope with occluded scenarios due to neglecting the inherent complementary nature of both modalities and the alignment may not be exploited enough, limiting the potential of their real-world deployment. In this paper, we present ViTaS, a simple yet effective framework that incorporates both visual and tactile information to guide the behavior of an agent. We introduce Soft Fusion Contrastive Learning, an advanced version of conventional contrastive learning method and a CVAE module to utilize the alignment and complementarity within visuo-tactile representations. We demonstrate the effectiveness of our method in 12 simulated and 3 real-world environments, and our experiments show that ViTaS significantly outperforms existing baselines. Project page: https://skyrainwind.github.io/ViTaS/index.html.

Related papers

Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection [51.52749744031413]
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions.<n>Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues.<n>We propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics.
arXiv Detail & Related papers (2025-07-23T12:30:19Z)
Touch in the Wild: Learning Fine-Grained Manipulation with a Portable Visuo-Tactile Gripper [7.618517580705364]
We present a portable, lightweight gripper with integrated tactile sensors.<n>We propose a cross-modal representation learning framework that integrates visual and tactile signals.<n>We validate our approach on fine-grained tasks such as test tube insertion and pipette-based fluid transfer.
arXiv Detail & Related papers (2025-07-20T17:53:59Z)
ConViTac: Aligning Visual-Tactile Fusion with Contrastive Representations [7.870120920732663]
We propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion.<n>Our key contribution is a Contrastive Embedding Conditioning mechanism that leverages a contrastive encoder pretrained to project visual and tactile inputs into unified latent embeddings.<n>We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods.
arXiv Detail & Related papers (2025-06-25T18:43:35Z)
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers [7.505873965164197]
We introduce ViTaPEs, a framework to learn task-agnostic representations for visuotactile perception.<n>Our approach exploits a novel multi-scale positional encoding scheme to capture intra-modal structures.<n>We show that ViTaPEs surpasses state-of-the-art baselines across various recognition tasks.
arXiv Detail & Related papers (2025-05-26T14:19:29Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Multimodal Visual-Tactile Representation Learning through Self-Supervised Contrastive Pre-Training [0.850206009406913]
MViTac is a novel methodology that leverages contrastive learning to integrate vision and touch sensations in a self-supervised fashion. By availing both sensory inputs, MViTac leverages intra and inter-modality losses for learning representations, resulting in enhanced material property classification and more adept grasping prediction.
arXiv Detail & Related papers (2024-01-22T15:11:57Z)
ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities. We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z)
The Power of the Senses: Generalizable Manipulation from Vision and Touch through Masked Multimodal Learning [60.91637862768949]
We propose Masked Multimodal Learning (M3L) to fuse visual and tactile information in a reinforcement learning setting. M3L learns a policy and visual-tactile representations based on masked autoencoding. We evaluate M3L on three simulated environments with both visual and tactile observations.
arXiv Detail & Related papers (2023-11-02T01:33:00Z)
Visuo-Tactile Transformers for Manipulation [4.60687205898687]
We present Visuo-Tactile Transformers (VTTs), a novel multimodal representation learning approach suited for model-based reinforcement learning and planning. Specifically, VTT uses tactile feedback together with self and cross-modal attention to build latent heatmap representations that focus attention on important task features in the visual domain.
arXiv Detail & Related papers (2022-09-30T22:38:29Z)
VIRT: Improving Representation-based Models for Text Matching through Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models. VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z)
Visual Adversarial Imitation Learning using Variational Models [60.69745540036375]
Reward function specification remains a major impediment for learning behaviors through deep reinforcement learning. Visual demonstrations of desired behaviors often presents an easier and more natural way to teach agents. We develop a variational model-based adversarial imitation learning algorithm.
arXiv Detail & Related papers (2021-07-16T00:15:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.