MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing
- URL: http://arxiv.org/abs/2511.12305v1
- Date: Sat, 15 Nov 2025 17:35:39 GMT
- Title: MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing
- Authors: Zhizhen Li, Xuanhao Luo, Xueren Ge, Longyu Zhou, Xingqin Lin, Yuchen Liu,
- Abstract summary: MMSense is a multi-modal, multi-task foundation model for unified wireless sensing.<n>Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations.<n>A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment.
- Score: 7.577654996150275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large AI models have been widely adopted in wireless communications for channel modeling, beamforming, and resource optimization. However, most existing efforts remain limited to single-modality inputs and channel-specific objec- tives, overlooking the broader potential of large foundation models for unified wireless sensing. To bridge this gap, we propose MMSense, a multi-modal, multi-task foundation model that jointly addresses channel-centric, environment-aware, and human-centered sensing. Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations, enabling effective cross-modal align- ment within a unified feature space. A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment and instruction-driven task adaptation. Furthermore, task- specific sequential attention and uncertainty-based loss weighting mechanisms enhance cross-task generalization. Experiments on real wireless scenario datasets show that our approach outper- forms both task-specific and large-model baselines, confirming its strong generalization across heterogeneous sensing tasks.
Related papers
- A Multi-Modal Foundational Model for Wireless Communication and Sensing [5.101849923596286]
This work introduces a task-agnostic, multi-modal foundational model for physical-layer wireless systems.<n>It learns transferable, physics-aware representations across heterogeneous modalities, enabling robust generalization across tasks and environments.<n>Our evaluations demonstrate superior generalization, robustness to deployment shifts, and reduced data requirements compared to task-specific baselines.
arXiv Detail & Related papers (2026-02-03T21:03:23Z) - Multimodal Wireless Foundation Models [7.397099215417549]
We build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities.<n>We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification)
arXiv Detail & Related papers (2025-11-19T06:26:49Z) - Sensing and Understanding the World over Air: A Large Multimodal Model for Mobile Networks [59.23869884913339]
Wireless-native multi-modal large models (WMLMs) can sense and understand the physical world through multi-modal data.<n>We constructed a GPT-style WMLM model and trained it on a real-world large-scale dataset, leveraging wireless signals as an anchor modality for contrastive learning.
arXiv Detail & Related papers (2025-11-17T07:33:46Z) - NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z) - Multi-Modal Manipulation via Multi-Modal Policy Consensus [62.49978559936122]
We propose a new approach to integrate diverse sensory modalities for robotic manipulation.<n>Our method factorizes the policy into a set of diffusion models, each specialized for a single representation.<n>We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion.
arXiv Detail & Related papers (2025-09-27T19:43:04Z) - A Wireless Foundation Model for Multi-Task Prediction [50.21098141769079]
We propose a unified foundation model for multi-task prediction in wireless networks that supports diverse prediction intervals.<n>After trained on large-scale datasets, the proposed foundation model demonstrates strong generalization to unseen scenarios and zero-shot performance on new tasks.
arXiv Detail & Related papers (2025-07-08T12:37:55Z) - Towards a Foundation Model for Communication Systems [16.85529517183343]
In this work, we take a step toward a foundation model for communication data.<n>We propose methodologies to address key challenges, including tokenization, positional embedding, multimodality, variable feature sizes, and normalization.<n>We empirically demonstrate that such a model can successfully estimate multiple features, including transmission rank, selected precoder, Doppler spread, and delay profile.
arXiv Detail & Related papers (2025-05-20T16:52:11Z) - A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning [19.277001743060435]
ContraWiMAE is a transformer-based foundation model that unifies masked reconstruction and masked contrastive learning for wireless channel representation.<n>Our key innovation is a new wireless-inspired contrastive objective that exploits the inherent characteristics of wireless environment, including noise, fading, and partial observability, as natural augmentation.
arXiv Detail & Related papers (2025-05-14T05:45:22Z) - Cross-domain Multi-modal Few-shot Object Detection via Rich Text [21.36633828492347]
Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks.<n>We study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method.
arXiv Detail & Related papers (2024-03-24T15:10:22Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - Multi-task Learning Approach for Modulation and Wireless Signal
Classification for 5G and Beyond: Edge Deployment via Model Compression [1.218340575383456]
Future communication networks must address the scarce spectrum to accommodate growth of heterogeneous wireless devices.
We exploit the potential of deep neural networks based multi-task learning framework to simultaneously learn modulation and signal classification tasks.
We provide a comprehensive heterogeneous wireless signals dataset for public use.
arXiv Detail & Related papers (2022-02-26T14:51:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.