Related papers: Sensing and Understanding the World over Air: A Large Multimodal Model for Mobile Networks

Sensing and Understanding the World over Air: A Large Multimodal Model for Mobile Networks

URL: http://arxiv.org/abs/2511.21707v1
Date: Mon, 17 Nov 2025 07:33:46 GMT
Title: Sensing and Understanding the World over Air: A Large Multimodal Model for Mobile Networks
Authors: Zhuoran Duan, Yuhao Wei, Guoshun Nan, Zijun Wang, Yan Yan, Lihua Xiong, Yuhan Ran, Ji Zhang, Jian Li, Qimei Cui, Xiaofeng Tao, Tony Q. S. Quek,
Abstract summary: Wireless-native multi-modal large models (WMLMs) can sense and understand the physical world through multi-modal data.<n>We constructed a GPT-style WMLM model and trained it on a real-world large-scale dataset, leveraging wireless signals as an anchor modality for contrastive learning.
Score: 59.23869884913339
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large models (LMs), such as ChatGPT, have made a significant impact across diverse domains and hold great potential to facilitate the evolution of network intelligence. Wireless-native multi-modal large models (WMLMs) can sense and understand the physical world through multi-modal data, serving as a key enabler that integrates communication, sensing, and intelligence, and thus they can boost various smart services to billions of users. However, research on WMLMs remains in its infancy, and the construction of domain-specific multi-modal large models for wireless networks is still underexplored. In this paper, we outlines the key characteristics of WMLMs and summarizes existing methods, on the basis of which a wireless-native multimodal training paradigm is proposed. Specifically, we constructed a GPT-style WMLM model and trained it on a real-world large-scale dataset, leveraging wireless signals as an anchor modality for contrastive learning. Our approach demonstrates outstanding performance compared with existing small-scale models and large multi-modal models, validating the feasibility of using wireless signals as a universal modality and highlighting WMLM's potential to emerge as a new paradigm for future wireless networks.

Related papers

MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing [7.577654996150275]
MMSense is a multi-modal, multi-task foundation model for unified wireless sensing.<n>Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations.<n>A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment.
arXiv Detail & Related papers (2025-11-15T17:35:39Z)
Large Multimodal Models-Empowered Task-Oriented Autonomous Communications: Design Methodology and Implementation Challenges [31.57528074626831]
Large language models (LLMs) and large multimodal models (LMMs) have achieved unprecedented breakthrough.<n>This article focuses on task-oriented autonomous communications with LLMs/LMMs.<n>We show that the proposed LLM/LMM-aided autonomous systems significantly outperform conventional and discriminative deep learning (DL) model-based techniques.
arXiv Detail & Related papers (2025-10-23T15:08:58Z)
Large Language Models for Wireless Communications: From Adaptation to Autonomy [47.40285060307752]
Large language models (LLMs) offer unprecedented capabilities in reasoning, generalization, and zero-shot learning.<n>This article explores the role of LLMs in transforming wireless systems across three key directions.
arXiv Detail & Related papers (2025-07-29T06:21:10Z)
NetOrchLLM: Mastering Wireless Network Orchestration with Large Language Models [11.015852090523229]
Large language models (LLMs) have revolutionized various domains by leveraging their sophisticated natural language understanding capabilities.<n>This paper presents NETORCHLLM, a wireless NETwork ORCHestrator LLM framework that seamlessly orchestrates diverse wireless-specific models.<n>A comprehensive framework is introduced, demonstrating the practical viability of our approach.
arXiv Detail & Related papers (2024-12-13T12:48:15Z)
Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives [56.2139730920855]
We present a systematic analysis of MM-VUFMs specifically designed for road scenes. Our objective is to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques. We provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models.
arXiv Detail & Related papers (2024-02-05T12:47:09Z)
MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [63.78384552789171]
This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm. We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z)
NExT-GPT: Any-to-Any Multimodal LLM [75.5656492989924]
We present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. We introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation.
arXiv Detail & Related papers (2023-09-11T15:02:25Z)
Large AI Model Empowered Multimodal Semantic Communications [48.73159237649128]
We propose a Large AI Model-based Multimodal SC (LAMMSC) framework. We first present the Conditional-based Multimodal Alignment (MMA) that enables the transformation between multimodal and unimodal data. Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery. Finally, we apply the Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information.
arXiv Detail & Related papers (2023-09-03T19:24:34Z)
Large Generative AI Models for Telecom: The Next Big Thing? [7.36678071967351]
Large GenAI models are envisioned to open up a new era of autonomous wireless networks. In this article, we aim to unfold the opportunities that can be reaped from integrating large GenAI models into the Telecom domain.
arXiv Detail & Related papers (2023-06-17T03:45:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.