Related papers: X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing

URL: http://arxiv.org/abs/2410.10167v2
Date: Fri, 18 Oct 2024 06:57:51 GMT
Title: X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing
Authors: Xinyan Chen, Jianfei Yang,
Abstract summary: Current human sensing primarily depends on cameras and LiDAR, each of which has its own strengths and limitations. Existing multi-modal fusion solutions are typically designed for fixed modality combinations. We propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue.
Score: 14.549639729808717
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiDAR, each of which has its own strengths and limitations. Furthermore, existing multi-modal fusion solutions are typically designed for fixed modality combinations, requiring extensive retraining when modalities are added or removed for diverse scenarios. In this paper, we propose a modality-invariant foundation model for all modalities, X-Fi, to address this issue. X-Fi enables the independent or combinatory use of sensor modalities without additional training by utilizing a transformer structure to accommodate variable input sizes and incorporating a novel "X-fusion" mechanism to preserve modality-specific features during multimodal integration. This approach not only enhances adaptability but also facilitates the learning of complementary features across modalities. Extensive experiments conducted on the MM-Fi and XRF55 datasets, employing six distinct modalities, demonstrate that X-Fi achieves state-of-the-art performance in human pose estimation (HPE) and human activity recognition (HAR) tasks. The findings indicate that our proposed model can efficiently support a wide range of human sensing applications, ultimately contributing to the evolution of scalable, multimodal sensing technologies.

Related papers

XTransfer: Cross-Modality Model Transfer for Human Sensing with Few Data at the Edge [32.69565269313996]
Current methods that rely on transferring pre-trained models often encounter issues such as modality shift.<n>We propose XTransfer, a first-of-its-kind method for resource-efficient, modality-agnostic model transfer.<n>XTransfer achieves state-of-the-art performance on human sensing tasks while significantly reducing the costs of sensor data collection, model training, and edge deployment.
arXiv Detail & Related papers (2025-06-28T02:14:43Z)
Towards Robust Multimodal Physiological Foundation Models: Handling Arbitrary Missing Modalities [9.785262633953794]
Physio Omni is a foundation model for multimodal physiological signal analysis. It trains a decoupled multimodal tokenizer, enabling masked signal pre-training. It achieves state-of-the-art performance while maintaining strong robustness to missing modalities.
arXiv Detail & Related papers (2025-04-28T09:00:04Z)
X-Capture: An Open-Source Portable Device for Multi-Sensory Learning [11.632896115888261]
We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets. X-Capture lays the groundwork for advancing human-like sensory representations in AI.
arXiv Detail & Related papers (2025-04-03T06:44:25Z)
AMM-Diff: Adaptive Multi-Modality Diffusion Network for Missing Modality Imputation [2.8498944632323755]
In clinical practice, full imaging is not always feasible, often due to complex acquisition protocols, stringent privacy regulations, or specific clinical needs. A promising solution is missing data imputation, where absent modalities are generated from available ones. We propose an Adaptive Multi-Modality Diffusion Network (AMM-Diff), a novel diffusion-based generative model capable of handling any number of input modalities and generating the missing ones.
arXiv Detail & Related papers (2025-01-22T12:29:33Z)
AdaptiveFusion: Adaptive Multi-Modal Multi-View Fusion for 3D Human Body Reconstruction [15.18875378385477]
We propose AdaptiveFusion, a generic adaptive multi-modal multi-view fusion framework. Our method achieves superior accuracy compared to state-of-the-art fusion methods.
arXiv Detail & Related papers (2024-09-07T15:06:30Z)
Backpropagation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration [37.456185990843515]
We introduce a Universal On-Device Multi-modal Model Adaptation Framework. The framework features the Fast Domain Adaptor (FDA) hosted in the cloud, providing tailored parameters for the Lightweight Multi-modal Model on devices. Our contributions represent a pioneering solution for on-Device Multi-modal Model Adaptation (DMMA)
arXiv Detail & Related papers (2024-05-21T14:42:18Z)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications. Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders. We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z)
Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent. Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Source-free Domain Adaptation Requires Penalized Diversity [60.04618512479438]
Source-free domain adaptation (SFDA) was introduced to address knowledge transfer between different domains in the absence of source data. In unsupervised SFDA, the diversity is limited to learning a single hypothesis on the source or learning multiple hypotheses with a shared feature extractor. We propose a novel unsupervised SFDA algorithm that promotes representational diversity through the use of separate feature extractors.
arXiv Detail & Related papers (2023-04-06T00:20:19Z)
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning [112.51498431119616]
This paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities. A single model, HighMMT, scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas.
arXiv Detail & Related papers (2022-03-02T18:56:20Z)
Invariant Feature Learning for Sensor-based Human Activity Recognition [11.334750079923428]
We present an invariant feature learning framework (IFLF) that extracts common information shared across subjects and devices. Experiments demonstrated that IFLF is effective in handling both subject and device diversion across popular open datasets and an in-house dataset.
arXiv Detail & Related papers (2020-12-14T21:56:17Z)
SensiX: A Platform for Collaborative Machine Learning on the Edge [69.1412199244903]
We present SensiX, a personal edge platform that stays between sensor data and sensing models. We demonstrate its efficacy in developing motion and audio-based multi-device sensing systems. Our evaluation shows that SensiX offers a 7-13% increase in overall accuracy and up to 30% increase across different environment dynamics at the expense of 3mW power overhead.
arXiv Detail & Related papers (2020-12-04T23:06:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.