Multimodal Wireless Foundation Models
- URL: http://arxiv.org/abs/2511.15162v1
- Date: Wed, 19 Nov 2025 06:26:49 GMT
- Title: Multimodal Wireless Foundation Models
- Authors: Ahmed Aboulfotouh, Hatem Abou-Zeid,
- Abstract summary: We build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities.<n>We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification)
- Score: 7.397099215417549
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Wireless foundation models (WFMs) have recently demonstrated promising capabilities, jointly performing multiple wireless functions and adapting effectively to new environments. However, while current WFMs process only one modality, depending on the task and operating conditions, the most informative modality changes and no single modality is best for all tasks. WFMs should therefore be designed to accept multiple modalities to enable a broader and more diverse range of tasks and scenarios. In this work, we propose and build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities (e.g., spectrograms and CSI) and performing multiple tasks across both. We introduce masked wireless modeling for the multimodal setting, a self-supervised objective and pretraining recipe that learns a joint representation from IQ streams and image-like wireless modalities. We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification). The multimodal WFM is competitive with single-modality WFMs, and in several cases surpasses their performance. Our results demonstrates the strong potential of developing multimodal WFMs that support diverse wireless tasks across different modalities. We believe this provides a concrete step toward both AI-native 6G and the vision of joint sensing, communication, and localization.
Related papers
- Multi-Modal Data-Enhanced Foundation Models for Prediction and Control in Wireless Networks: A Survey [9.762879334040566]
Foundation models (FMs) are recognized as a transformative breakthrough that has started to reshape the future of artificial intelligence (AI)<n>This work discusses the utilization of FMs, especially multi-modal FMs in wireless networks.
arXiv Detail & Related papers (2026-01-06T16:59:29Z) - Sensing and Understanding the World over Air: A Large Multimodal Model for Mobile Networks [59.23869884913339]
Wireless-native multi-modal large models (WMLMs) can sense and understand the physical world through multi-modal data.<n>We constructed a GPT-style WMLM model and trained it on a real-world large-scale dataset, leveraging wireless signals as an anchor modality for contrastive learning.
arXiv Detail & Related papers (2025-11-17T07:33:46Z) - MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing [7.577654996150275]
MMSense is a multi-modal, multi-task foundation model for unified wireless sensing.<n>Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations.<n>A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment.
arXiv Detail & Related papers (2025-11-15T17:35:39Z) - Hierarchical Federated Foundation Models over Wireless Networks for Multi-Modal Multi-Task Intelligence: Integration of Edge Learning with D2D/P2P-Enabled Fog Learning Architectures [58.72593025539547]
In this paper, we unveil an unexplored variation of M3T FFMs by proposing hierarchical federated foundation models (HF-FMs)<n>HF-FMs strategically align the modular structure of M3T FMs, comprising modality encoders, prompts, mixture-of-experts (MoEs), adapters, and task heads.<n>To demonstrate their potential, we prototype HF-FMs in a wireless network setting and release the open-source code for the development of HF-FMs.
arXiv Detail & Related papers (2025-09-03T20:23:19Z) - Chain-of-Focus: Adaptive Visual Search and Zooming for Multimodal Reasoning via RL [70.1326027641056]
Vision language models (VLMs) have achieved impressive performance across a variety of computer vision tasks.<n>We propose a Chain-of-Focus (CoF) method that allows VLMs to perform adaptive focusing and zooming in on key image regions.<n>We present a two-stage training pipeline, including supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-21T12:18:15Z) - AI2MMUM: AI-AI Oriented Multi-Modal Universal Model Leveraging Telecom Domain Large Model [8.404195378257178]
We propose a scalable, task-aware artificial intelligence-air interface multi-modal universal model (AI2MMUM)<n>To enhance task adaptability, task instructions consist of fixed task keywords and learnable, implicit prefix prompts.<n> lightweight task-specific heads are designed to directly output task objectives.
arXiv Detail & Related papers (2025-05-15T06:32:59Z) - 6G WavesFM: A Foundation Model for Sensing, Communication, and Localization [6.70088826174291]
This paper introduces a novel Wireless Foundation Model (WFM) framework, capable of supporting a wide array of communication, sensing, and localization tasks.<n>Our proposed architecture combines a shared Vision Transformer (ViT) backbone with task-specific multi-layer perceptron heads and incorporates Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning.<n>We show that our unified WFM can support diverse tasks and deliver significant gains in both performance and efficiency.
arXiv Detail & Related papers (2025-04-18T22:51:35Z) - MMGen: Unified Multi-modal Image Generation and Understanding in One Go [60.97155790727879]
We introduce MMGen, a unified framework that integrates multiple generative tasks into a single diffusion model.<n>Our approach develops a novel diffusion transformer that flexibly supports multi-modal output, along with a simple modality-decoupling strategy.
arXiv Detail & Related papers (2025-03-26T15:37:17Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.