Related papers: D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference

D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference

URL: http://arxiv.org/abs/2509.09747v1
Date: Thu, 11 Sep 2025 10:54:07 GMT
Title: D-CAT: Decoupled Cross-Attention Transfer between Sensor Modalities for Unimodal Inference
Authors: Leen Daher, Zhaobo Wang, Malcolm Mielle,
Abstract summary: Cross-modal transfer learning is used to improve multi-modal classification models.<n>Existing methods require paired sensor data at both training and inference.<n>We propose Decoupled Cross-Attention Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference.
Score: 3.6344649347926326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-modal transfer learning is used to improve multi-modal classification models (e.g., for human activity recognition in human-robot collaboration). However, existing methods require paired sensor data at both training and inference, limiting deployment in resource-constrained environments where full sensor suites are not economically and technically usable. To address this, we propose Decoupled Cross-Attention Transfer (D-CAT), a framework that aligns modality-specific representations without requiring joint sensor modality during inference. Our approach combines a self-attention module for feature extraction with a novel cross-attention alignment loss, which enforces the alignment of sensors' feature spaces without requiring the coupling of the classification pipelines of both modalities. We evaluate D-CAT on three multi-modal human activity datasets (IMU, video, and audio) under both in-distribution and out-of-distribution scenarios, comparing against uni-modal models. Results show that in in-distribution scenarios, transferring from high-performing modalities (e.g., video to IMU) yields up to 10% F1-score gains over uni-modal training. In out-of-distribution scenarios, even weaker source modalities (e.g., IMU to video) improve target performance, as long as the target model isn't overfitted on the training data. By enabling single-sensor inference with cross-modal knowledge, D-CAT reduces hardware redundancy for perception systems while maintaining accuracy, which is critical for cost-sensitive or adaptive deployments (e.g., assistive robots in homes with variable sensor availability). Code is available at https://github.com/Schindler-EPFL-Lab/D-CAT.

Related papers

Wireless Dataset Similarity: Measuring Distances in Supervised and Unsupervised Machine Learning [15.036550722400085]
This paper introduces a task- and model-aware framework for measuring similarity between wireless datasets.<n>It enables applications such as dataset selection/augmentation, simulation-to-real (sim2real) comparison, and informing decisions on model training/adaptation to new deployments.
arXiv Detail & Related papers (2026-01-03T01:15:27Z)
XTransfer: Cross-Modality Model Transfer for Human Sensing with Few Data at the Edge [32.69565269313996]
Current methods that rely on transferring pre-trained models often encounter issues such as modality shift.<n>We propose XTransfer, a first-of-its-kind method for resource-efficient, modality-agnostic model transfer.<n>XTransfer achieves state-of-the-art performance on human sensing tasks while significantly reducing the costs of sensor data collection, model training, and edge deployment.
arXiv Detail & Related papers (2025-06-28T02:14:43Z)
CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems [38.20651868834145]
We propose Collaborative Auxiliary Modality Learning (CAML), a novel multi-modal multi-agent framework.<n>We show that CAML achieves up to a $bf 58.1%$ improvement in accident detection.<n>We also validate CAML on real-world aerial-ground robot data for collaborative semantic segmentation.
arXiv Detail & Related papers (2025-02-25T03:59:40Z)
PRIMUS: Pretraining IMU Encoders with Multimodal Self-Supervision [7.896850422430362]
Unlabeled or weakly labeled IMU data can be used to model human motions.<n>We propose PRIMUS: a method for PRetraining IMU encoderS that uses a novel pretraining objective.<n> PRIMUS improves test accuracy by up to 15%, compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-22T18:46:30Z)
M3BAT: Unsupervised Domain Adaptation for Multimodal Mobile Sensing with Multi-Branch Adversarial Training [5.128670847334003]
multimodal mobile sensing has been used extensively for inferences regarding health and well being, behavior, and context. The distribution of data in the training set differs from the distribution of data in the real world, the deployment environment. We propose M3BAT, an unsupervised domain adaptation for multimodal mobile sensing with multi-branch adversarial training.
arXiv Detail & Related papers (2024-04-26T13:09:35Z)
Convolutional Monge Mapping Normalization for learning on sleep data [63.22081662149488]
We propose a new method called Convolutional Monge Mapping Normalization (CMMN) CMMN consists in filtering the signals in order to adapt their power spectrum density (PSD) to a Wasserstein barycenter estimated on training data. Numerical experiments on sleep EEG data show that CMMN leads to significant and consistent performance gains independent from the neural network architecture.
arXiv Detail & Related papers (2023-05-30T08:24:01Z)
Task-Oriented Sensing, Computation, and Communication Integration for Multi-Device Edge AI [108.08079323459822]
This paper studies a new multi-intelligent edge artificial-latency (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC) We measure the inference accuracy by adopting an approximate but tractable metric, namely discriminant gain.
arXiv Detail & Related papers (2022-07-03T06:57:07Z)
Multi-modal Sensor Data Fusion for In-situ Classification of Animal Behavior Using Accelerometry and GNSS Data [16.47484520898938]
We examine using data from multiple sensing modes, i.e., accelerometry and global navigation satellite system (GNSS) for classifying animal behavior. We develop multi-modal animal behavior classification algorithms using two real-world datasets collected via smart cattle collar and ear tags.
arXiv Detail & Related papers (2022-06-24T04:54:03Z)
Federated Deep Learning Meets Autonomous Vehicle Perception: Design and Verification [168.67190934250868]
Federated learning empowered connected autonomous vehicle (FLCAV) has been proposed. FLCAV preserves privacy while reducing communication and annotation costs. It is challenging to determine the network resources and road sensor poses for multi-stage training.
arXiv Detail & Related papers (2022-06-03T23:55:45Z)
Parallel Successive Learning for Dynamic Distributed Model Training over Heterogeneous Wireless Networks [50.68446003616802]
Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices. We develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions. Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning.
arXiv Detail & Related papers (2022-02-07T05:11:01Z)
DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference [86.03382625531951]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference.<n>It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity.<n>Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z)
Modality Compensation Network: Cross-Modal Adaptation for Action Recognition [77.24983234113957]
We propose a Modality Compensation Network (MCN) to explore the relationships of different modalities. Our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning. Experimental results reveal that MCN outperforms state-of-the-art approaches on four widely-used action recognition benchmarks.
arXiv Detail & Related papers (2020-01-31T04:51:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.