Large Multimodal Models-Empowered Task-Oriented Autonomous Communications: Design Methodology and Implementation Challenges
- URL: http://arxiv.org/abs/2510.20637v1
- Date: Thu, 23 Oct 2025 15:08:58 GMT
- Title: Large Multimodal Models-Empowered Task-Oriented Autonomous Communications: Design Methodology and Implementation Challenges
- Authors: Hyun Jong Yang, Hyunsoo Kim, Hyeonho Noh, Seungnyun Kim, Byonghyo Shim,
- Abstract summary: Large language models (LLMs) and large multimodal models (LMMs) have achieved unprecedented breakthrough.<n>This article focuses on task-oriented autonomous communications with LLMs/LMMs.<n>We show that the proposed LLM/LMM-aided autonomous systems significantly outperform conventional and discriminative deep learning (DL) model-based techniques.
- Score: 31.57528074626831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) and large multimodal models (LMMs) have achieved unprecedented breakthrough, showcasing remarkable capabilities in natural language understanding, generation, and complex reasoning. This transformative potential has positioned them as key enablers for 6G autonomous communications among machines, vehicles, and humanoids. In this article, we provide an overview of task-oriented autonomous communications with LLMs/LMMs, focusing on multimodal sensing integration, adaptive reconfiguration, and prompt/fine-tuning strategies for wireless tasks. We demonstrate the framework through three case studies: LMM-based traffic control, LLM-based robot scheduling, and LMM-based environment-aware channel estimation. From experimental results, we show that the proposed LLM/LMM-aided autonomous systems significantly outperform conventional and discriminative deep learning (DL) model-based techniques, maintaining robustness under dynamic objectives, varying input parameters, and heterogeneous multimodal conditions where conventional static optimization degrades.
Related papers
- Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z) - Do LLM Modules Generalize? A Study on Motion Generation for Autonomous Driving [15.903491909277745]
We present a comprehensive evaluation of five key LLM modules.<n>We demonstrate that, when appropriately adapted, these modules can significantly improve performance for autonomous driving motion generation.<n>In addition, we identify which techniques can be effectively transferred, analyze the potential reasons for the failure of others, and discuss the specific adaptations needed for autonomous driving scenarios.
arXiv Detail & Related papers (2025-09-02T19:02:49Z) - Large Language Models for Wireless Communications: From Adaptation to Autonomy [47.40285060307752]
Large language models (LLMs) offer unprecedented capabilities in reasoning, generalization, and zero-shot learning.<n>This article explores the role of LLMs in transforming wireless systems across three key directions.
arXiv Detail & Related papers (2025-07-29T06:21:10Z) - Modular Machine Learning: An Indispensable Path towards New-Generation Large Language Models [33.822930522694406]
We overview a promising learning paradigm, i.e., Modular Machine Learning (MML), as an essential approach toward new-generation large language models (LLMs)<n>We propose a unified MML framework for LLMs, which decomposes the complex structure of LLMs into three interdependent components: modular representation, modular model, and modular reasoning.<n>Ultimately, we believe the integration of the MML with LLMs has the potential to bridge the gap between statistical (deep) learning and formal (logical) reasoning.
arXiv Detail & Related papers (2025-04-28T17:42:02Z) - HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding [67.24430397016275]
We propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner.<n>The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.
arXiv Detail & Related papers (2025-03-12T06:01:05Z) - MALMM: Multi-Agent Large Language Models for Zero-Shot Robotics Manipulation [62.854649499866774]
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotics manipulation and navigation.<n>We propose a novel multi-agent LLM framework that distributes high-level planning and low-level control code generation across specialized LLM agents.<n>We evaluate our approach on nine RLBench tasks, including long-horizon tasks, and demonstrate its ability to solve robotics manipulation in a zero-shot setting.
arXiv Detail & Related papers (2024-11-26T17:53:44Z) - Large Multi-Modal Models (LMMs) as Universal Foundation Models for
AI-Native Wireless Systems [57.41621687431203]
Large language models (LLMs) and foundation models have been recently touted as a game-changer for 6G systems.
This paper presents a comprehensive vision on how to design universal foundation models tailored towards the deployment of artificial intelligence (AI)-native networks.
arXiv Detail & Related papers (2024-01-30T00:21:41Z) - LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving [84.31119464141631]
This work employs Large Language Models (LLMs) as a decision-making component for complex autonomous driving scenarios.<n>Extensive experiments demonstrate that our proposed method not only consistently surpasses baseline approaches in single-vehicle tasks, but also helps handle complex driving behaviors even multi-vehicle coordination.
arXiv Detail & Related papers (2023-10-04T17:59:49Z) - Large AI Model Empowered Multimodal Semantic Communications [48.73159237649128]
We propose a Large AI Model-based Multimodal SC (LAMMSC) framework.
We first present the Conditional-based Multimodal Alignment (MMA) that enables the transformation between multimodal and unimodal data.
Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows users to perform personalized semantic extraction or recovery.
Finally, we apply the Generative adversarial network-based channel Estimation (CGE) for estimating the wireless channel state information.
arXiv Detail & Related papers (2023-09-03T19:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.