Related papers: MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing

MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing

URL: http://arxiv.org/abs/2511.12305v1
Date: Sat, 15 Nov 2025 17:35:39 GMT
Title: MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing
Authors: Zhizhen Li, Xuanhao Luo, Xueren Ge, Longyu Zhou, Xingqin Lin, Yuchen Liu,
Abstract summary: MMSense is a multi-modal, multi-task foundation model for unified wireless sensing.<n>Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations.<n>A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment.
Score: 7.577654996150275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large AI models have been widely adopted in wireless communications for channel modeling, beamforming, and resource optimization. However, most existing efforts remain limited to single-modality inputs and channel-specific objec- tives, overlooking the broader potential of large foundation models for unified wireless sensing. To bridge this gap, we propose MMSense, a multi-modal, multi-task foundation model that jointly addresses channel-centric, environment-aware, and human-centered sensing. Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations, enabling effective cross-modal align- ment within a unified feature space. A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment and instruction-driven task adaptation. Furthermore, task- specific sequential attention and uncertainty-based loss weighting mechanisms enhance cross-task generalization. Experiments on real wireless scenario datasets show that our approach outper- forms both task-specific and large-model baselines, confirming its strong generalization across heterogeneous sensing tasks.

Related papers

A Multi-Modal Foundational Model for Wireless Communication and Sensing [5.101849923596286]
This work introduces a task-agnostic, multi-modal foundational model for physical-layer wireless systems.<n>It learns transferable, physics-aware representations across heterogeneous modalities, enabling robust generalization across tasks and environments.<n>Our evaluations demonstrate superior generalization, robustness to deployment shifts, and reduced data requirements compared to task-specific baselines.
arXiv Detail & Related papers (2026-02-03T21:03:23Z)
Multimodal Wireless Foundation Models [7.397099215417549]
We build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities.<n>We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification)
arXiv Detail & Related papers (2025-11-19T06:26:49Z)
Sensing and Understanding the World over Air: A Large Multimodal Model for Mobile Networks [59.23869884913339]
Wireless-native multi-modal large models (WMLMs) can sense and understand the physical world through multi-modal data.<n>We constructed a GPT-style WMLM model and trained it on a real-world large-scale dataset, leveraging wireless signals as an anchor modality for contrastive learning.
arXiv Detail & Related papers (2025-11-17T07:33:46Z)
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching [64.10695425442164]
We introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms.<n>Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks.<n>To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.
arXiv Detail & Related papers (2025-10-15T16:25:18Z)
Multi-Modal Manipulation via Multi-Modal Policy Consensus [62.49978559936122]
We propose a new approach to integrate diverse sensory modalities for robotic manipulation.<n>Our method factorizes the policy into a set of diffusion models, each specialized for a single representation.<n>We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion.
arXiv Detail & Related papers (2025-09-27T19:43:04Z)
A Wireless Foundation Model for Multi-Task Prediction [50.21098141769079]
We propose a unified foundation model for multi-task prediction in wireless networks that supports diverse prediction intervals.<n>After trained on large-scale datasets, the proposed foundation model demonstrates strong generalization to unseen scenarios and zero-shot performance on new tasks.
arXiv Detail & Related papers (2025-07-08T12:37:55Z)
Towards a Foundation Model for Communication Systems [16.85529517183343]
In this work, we take a step toward a foundation model for communication data.<n>We propose methodologies to address key challenges, including tokenization, positional embedding, multimodality, variable feature sizes, and normalization.<n>We empirically demonstrate that such a model can successfully estimate multiple features, including transmission rank, selected precoder, Doppler spread, and delay profile.
arXiv Detail & Related papers (2025-05-20T16:52:11Z)
A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning [19.277001743060435]
ContraWiMAE is a transformer-based foundation model that unifies masked reconstruction and masked contrastive learning for wireless channel representation.<n>Our key innovation is a new wireless-inspired contrastive objective that exploits the inherent characteristics of wireless environment, including noise, fading, and partial observability, as natural augmentation.
arXiv Detail & Related papers (2025-05-14T05:45:22Z)
Cross-domain Multi-modal Few-shot Object Detection via Rich Text [21.36633828492347]
Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks.<n>We study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method.
arXiv Detail & Related papers (2024-03-24T15:10:22Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
Multi-task Learning Approach for Modulation and Wireless Signal Classification for 5G and Beyond: Edge Deployment via Model Compression [1.218340575383456]
Future communication networks must address the scarce spectrum to accommodate growth of heterogeneous wireless devices. We exploit the potential of deep neural networks based multi-task learning framework to simultaneously learn modulation and signal classification tasks. We provide a comprehensive heterogeneous wireless signals dataset for public use.
arXiv Detail & Related papers (2022-02-26T14:51:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.