Related papers: TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception

TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception

URL: http://arxiv.org/abs/2511.19509v1
Date: Mon, 24 Nov 2025 00:43:59 GMT
Title: TouchFormer: A Robust Transformer-based Framework for Multimodal Material Perception
Authors: Kailin Lyu, Long Xiao, Jianing Zeng, Junhao Dong, Xuexin Liu, Zhuojun Zou, Haoyue Yang, Lin Shu, Jie Hao,
Abstract summary: We propose a robust multimodal fusion framework, TouchFormer.<n>We employ a Modality-Adaptive Gating mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features.<n>We show that TouchFormer achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and subcategory tasks, respectively.
Score: 8.939880394166348
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual multimodal material perception. Despite this, existing approaches frequently perform naive fusion of multimodal inputs, overlooking key challenges such as modality-specific noise, missing modalities common in real-world scenarios, and the dynamically varying importance of each modality depending on the task. These limitations lead to suboptimal performance across several benchmark tasks. In this paper, we propose a robust multimodal fusion framework, TouchFormer. Specifically, we employ a Modality-Adaptive Gating (MAG) mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features, enhancing model robustness. Additionally, we introduce a Cross-Instance Embedding Regularization(CER) strategy, which significantly improves classification accuracy in fine-grained subcategory material recognition tasks. Experimental results demonstrate that, compared to existing non-visual methods, the proposed TouchFormer framework achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and USMC tasks, respectively. Furthermore, real-world robotic experiments validate TouchFormer's effectiveness in enabling robots to better perceive and interpret their environment, paving the way for its deployment in safety-critical applications such as emergency response and industrial automation. The code and datasets will be open-source, and the videos are available in the supplementary materials.

Related papers

Uncertainty-Aware Multimodal Emotion Recognition through Dirichlet Parameterization [0.06596280437011041]
We present a lightweight and privacy-preserving Multimodal Emotion Recognition (MER) framework designed for deployment on edge devices.<n>Our implementation uses three modalities - speech, text and facial imagery.<n>We introduce a model- and task-agnostic fusion mechanism grounded in Dempster-Shafer theory and Dirichlet evidence.
arXiv Detail & Related papers (2026-02-09T19:12:30Z)
Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification [33.302856478333524]
Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects.<n>Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness.<n>We propose a Missing-aware Mixture-of-Loras framework that reformulates modality missing as a multi-task learning problem.
arXiv Detail & Related papers (2025-11-14T16:31:37Z)
Multi-Modal Manipulation via Multi-Modal Policy Consensus [62.49978559936122]
We propose a new approach to integrate diverse sensory modalities for robotic manipulation.<n>Our method factorizes the policy into a set of diffusion models, each specialized for a single representation.<n>We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion.
arXiv Detail & Related papers (2025-09-27T19:43:04Z)
A CLIP-based Uncertainty Modal Modeling (UMM) Framework for Pedestrian Re-Identification in Autonomous Driving [6.223368492604449]
Uncertainty Modal Modeling (UMM) framework integrates a multimodal token mapper, synthetic modality augmentation strategy, and cross-modal cue interactive learner.<n>UMM achieves strong robustness, generalization, and computational efficiency under uncertain modality conditions.
arXiv Detail & Related papers (2025-08-15T04:50:27Z)
Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark [69.02666229531322]
We introduce a pioneering study that investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD)<n>We find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation.<n>We propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR.
arXiv Detail & Related papers (2024-10-02T16:47:55Z)
Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models. The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN) We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.