Related papers: Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities

Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities

URL: http://arxiv.org/abs/2511.11512v1
Date: Fri, 14 Nov 2025 17:34:20 GMT
Title: Collaborative Representation Learning for Alignment of Tactile, Language, and Vision Modalities
Authors: Yiyun Zhou, Mingjing Xu, Jingwei Shi, Quanjiang Li, Jingyuan Chen,
Abstract summary: Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties.<n>Existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities.<n>We propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method.
Score: 19.45726946555448
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tactile sensing offers rich and complementary information to vision and language, enabling robots to perceive fine-grained object properties. However, existing tactile sensors lack standardization, leading to redundant features that hinder cross-sensor generalization. Moreover, existing methods fail to fully integrate the intermediate communication among tactile, language, and vision modalities. To address this, we propose TLV-CoRe, a CLIP-based Tactile-Language-Vision Collaborative Representation learning method. TLV-CoRe introduces a Sensor-Aware Modulator to unify tactile features across different sensors and employs tactile-irrelevant decoupled learning to disentangle irrelevant tactile features. Additionally, a Unified Bridging Adapter is introduced to enhance tri-modal interaction within the shared representation space. To fairly evaluate the effectiveness of tactile models, we further propose the RSS evaluation framework, focusing on Robustness, Synergy, and Stability across different methods. Experimental results demonstrate that TLV-CoRe significantly improves sensor-agnostic representation learning and cross-modal alignment, offering a new direction for multimodal tactile representation.

Related papers

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation [21.78866976181311]
See-through-skin (STS) sensors combine tactile and visual perception.<n>Existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking.<n>We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction.
arXiv Detail & Related papers (2025-12-10T17:35:13Z)
VLA-Touch: Enhancing Vision-Language-Action Models with Dual-Level Tactile Feedback [21.08021535027628]
We present VLA-Touch, an approach that enhances generalist robot policies with tactile sensing.<n>Our method introduces two key innovations: (1) a pipeline that leverages a pretrained tactile-language model that provides semantic tactile feedback for high-level task planning, and (2) a diffusion-based controller that refines VLA-generated actions with tactile signals for contact-rich manipulation.
arXiv Detail & Related papers (2025-07-23T07:54:10Z)
Universal Visuo-Tactile Video Understanding for Embodied Interaction [16.587054862266168]
We present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video understanding.<n>VTV-LLM bridges the gap between tactile perception and natural language.<n>We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation.
arXiv Detail & Related papers (2025-05-28T16:43:01Z)
Towards Generalization of Tactile Image Generation: Reference-Free Evaluation in a Leakage-Free Setting [25.355424080824996]
Tactile sensing is critical for human perception and underpins applications in computer vision, robotics, and multimodal learning.<n>Because tactile data is often scarce and costly to acquire, generating synthetic tactile images provides a scalable solution to augment real-world measurements.<n>We demonstrate that overlapping training and test samples in commonly used datasets inflate performance metrics, obscuring the true generalizability of tactile models.
arXiv Detail & Related papers (2025-03-10T02:37:22Z)
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors [11.506370451126378]
Visuo-tactile sensors aim to emulate human tactile perception, enabling robots to understand and manipulate objects.<n>We introduce TacQuad, an aligned multi-modal tactile multi-sensor dataset from four different visuo-tactile sensors.<n>We propose AnyTouch, a unified static-dynamic multi-sensor representation learning framework with a multi-level structure.
arXiv Detail & Related papers (2025-02-15T08:33:25Z)
Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset [50.09271028495819]
multimodal research related to touch focuses on visual and tactile modalities. We construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration.
arXiv Detail & Related papers (2024-03-14T19:01:54Z)
Tactile-Filter: Interactive Tactile Perception for Part Mating [54.46221808805662]
Humans rely on touch and tactile sensing for a lot of dexterous manipulation tasks. vision-based tactile sensors are being widely used for various robotic perception and control tasks. We present a method for interactive perception using vision-based tactile sensors for a part mating task.
arXiv Detail & Related papers (2023-03-10T16:27:37Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
Dynamic Modeling of Hand-Object Interactions via Tactile Sensing [133.52375730875696]
In this work, we employ a high-resolution tactile glove to perform four different interactive activities on a diversified set of objects. We build our model on a cross-modal learning framework and generate the labels using a visual processing pipeline to supervise the tactile model. This work takes a step on dynamics modeling in hand-object interactions from dense tactile sensing.
arXiv Detail & Related papers (2021-09-09T16:04:14Z)
Elastic Tactile Simulation Towards Tactile-Visual Perception [58.44106915440858]
We propose Elastic Interaction of Particles (EIP) for tactile simulation. EIP models the tactile sensor as a group of coordinated particles, and the elastic property is applied to regulate the deformation of particles during contact. We further propose a tactile-visual perception network that enables information fusion between tactile data and visual images.
arXiv Detail & Related papers (2021-08-11T03:49:59Z)
Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition [131.6328804788164]
We propose a framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos) The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality.
arXiv Detail & Related papers (2020-09-01T03:38:31Z)
OmniTact: A Multi-Directional High Resolution Touch Sensor [109.28703530853542]
Existing tactile sensors are either flat, have small sensitive fields or only provide low-resolution signals. We introduce OmniTact, a multi-directional high-resolution tactile sensor. We evaluate the capabilities of OmniTact on a challenging robotic control task.
arXiv Detail & Related papers (2020-03-16T01:31:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.