OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning
- URL: http://arxiv.org/abs/2601.00352v1
- Date: Thu, 01 Jan 2026 14:11:49 GMT
- Title: OmniVaT: Single Domain Generalization for Multimodal Visual-Tactile Learning
- Authors: Liuxiang Qiu, Hui Da, Yuzhen Niu, Tiesong Zhao, Yang Cao, Zheng-Jun Zha,
- Abstract summary: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors.<n>We formulate these challenges as a new task, termed single domain generalization for multimodal VTL.<n>We propose an OmniVaT framework that, for the first time, successfully addresses this task.
- Score: 66.4730970958238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-tactile learning (VTL) enables embodied agents to perceive the physical world by integrating visual (VIS) and tactile (TAC) sensors. However, VTL still suffers from modality discrepancies between VIS and TAC images, as well as domain gaps caused by non-standardized tactile sensors and inconsistent data collection procedures. We formulate these challenges as a new task, termed single domain generalization for multimodal VTL (SDG-VTL). In this paper, we propose an OmniVaT framework that, for the first time, successfully addresses this task. On the one hand, OmniVaT integrates a multimodal fractional Fourier adapter (MFFA) to map VIS and TAC embeddings into a unified embedding-frequency space, thereby effectively mitigating the modality gap without multi-domain training data or careful cross-modal fusion strategies. On the other hand, it also incorporates a discrete tree generation (DTG) module that obtains diverse and reliable multimodal fractional representations through a hierarchical tree structure, thereby enhancing its adaptivity to fluctuating domain shifts in unseen domains. Extensive experiments demonstrate the superior cross-domain generalization performance of OmniVaT on the SDG-VTL task.
Related papers
- Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation [74.16390314862801]
We study the challenging task of Few-Shot Video Domain Adaptation (FSVDA)<n>We introduce a novel framework of Modality-Collaborative LowRank Decomposers (MC-LRD) to decompose modality-unique and modality-shared features.<n>Our model achieves significant improvements over existing methods.
arXiv Detail & Related papers (2025-11-24T03:09:59Z) - Rethinking Efficient Mixture-of-Experts for Remote Sensing Modality-Missing Classification [33.302856478333524]
Multimodal classification in remote sensing often suffers from missing modalities caused by environmental interference, sensor failures, or atmospheric effects.<n>Existing two-stage adaptation methods are computationally expensive and assume complete multimodal data during training, limiting their generalization to real-world incompleteness.<n>We propose a Missing-aware Mixture-of-Loras framework that reformulates modality missing as a multi-task learning problem.
arXiv Detail & Related papers (2025-11-14T16:31:37Z) - NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning [1.7603474309877931]
NexViTAD is a cross-domain anomaly detection framework based on vision foundation models.<n>It addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms.<n>It delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains.
arXiv Detail & Related papers (2025-07-10T09:29:26Z) - AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [49.81255045696323]
We present the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet)<n>AuxDet integrates metadata semantics with visual features, guiding adaptive representation learning for each sample.<n>Experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-05-21T07:02:05Z) - MultiTSF: Transformer-based Sensor Fusion for Human-Centric Multi-view and Multi-modal Action Recognition [2.7745600113170994]
Action recognition from multi-modal and multi-view observations holds significant potential for applications in surveillance, robotics, and smart environments.<n>We propose the Multi-modal Multi-view Transformer-based Sensor Fusion (MultiTSF)<n>The proposed method leverages a Transformer-based to dynamically model inter-view relationships and capture temporal dependencies across multiple views.
arXiv Detail & Related papers (2025-04-03T05:04:05Z) - Let Synthetic Data Shine: Domain Reassembly and Soft-Fusion for Single Domain Generalization [68.41367635546183]
Single Domain Generalization aims to train models with consistent performance across diverse scenarios using data from a single source.<n>We propose Discriminative Domain Reassembly and Soft-Fusion (DRSF), a training framework leveraging synthetic data to improve model generalization.
arXiv Detail & Related papers (2025-03-17T18:08:03Z) - MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z) - M2Former: Multi-Scale Patch Selection for Fine-Grained Visual
Recognition [4.621578854541836]
We propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models.
Specifically, MSPS selects salient patches of different scales at different stages of a vision Transformer (MS-ViT)
In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions.
arXiv Detail & Related papers (2023-08-04T06:41:35Z) - MDViT: Multi-domain Vision Transformer for Small Medical Image Segmentation Datasets [19.44142290594537]
Vision transformers (ViTs) have emerged as a promising solution to improve medical image segmentation (MIS)
ViTs are typically trained using a single source of data, which overlooks the valuable knowledge that could be leveraged from other available datasets.
In this paper, we propose MDViT, the first multi-domain ViT that includes domain adapters to mitigate data-hunger and combat NKT.
arXiv Detail & Related papers (2023-07-05T08:19:29Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.