Large AI Model Empowered Multimodal Semantic Communications
- URL: http://arxiv.org/abs/2309.01249v1
- Date: Sun, 3 Sep 2023 19:24:34 GMT
- Title: Large AI Model Empowered Multimodal Semantic Communications
- Authors: Feibo Jiang, Yubo Peng, Li Dong, Kezhi Wang, Kun Yang, Cunhua Pan,
Xiaohu You
- Abstract summary: We propose a Large AI Model-based Multimodal SC (LAM-MSC) framework.
We first present the SC-based Multimodal Alignment (MMA)
Then, a personalized LLM-based Knowledge Base (LKB) is proposed.
Finally, we apply the Conditional Generative adversarial networks-based channel Estimation (CGE) to obtain Channel State Information (CSI)
- Score: 51.17527319441436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal signals, including text, audio, image and video, can be integrated
into Semantic Communication (SC) for providing an immersive experience with low
latency and high quality at the semantic level. However, the multimodal SC has
several challenges, including data heterogeneity, semantic ambiguity, and
signal fading. Recent advancements in large AI models, particularly in
Multimodal Language Model (MLM) and Large Language Model (LLM), offer potential
solutions for these issues. To this end, we propose a Large AI Model-based
Multimodal SC (LAM-MSC) framework, in which we first present the MLM-based
Multimodal Alignment (MMA) that utilizes the MLM to enable the transformation
between multimodal and unimodal data while preserving semantic consistency.
Then, a personalized LLM-based Knowledge Base (LKB) is proposed, which allows
users to perform personalized semantic extraction or recovery through the LLM.
This effectively addresses the semantic ambiguity. Finally, we apply the
Conditional Generative adversarial networks-based channel Estimation (CGE) to
obtain Channel State Information (CSI). This approach effectively mitigates the
impact of fading channels in SC. Finally, we conduct simulations that
demonstrate the superior performance of the LAM-MSC framework.
Related papers
- UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - Model Composition for Multimodal Large Language Models [73.70317850267149]
We propose a new paradigm through the model composition of existing MLLMs to create a new model that retains the modal understanding capabilities of each original model.
Our basic implementation, NaiveMC, demonstrates the effectiveness of this paradigm by reusing modality encoders and merging LLM parameters.
arXiv Detail & Related papers (2024-02-20T06:38:10Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples [30.284100018891397]
Multi-Modal In-Context Tuning (MMICT) is a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning.
We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives.
Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features.
arXiv Detail & Related papers (2023-12-11T13:11:04Z) - AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [33.072967177313025]
We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals.
AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B)
We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.
arXiv Detail & Related papers (2023-09-27T22:50:51Z) - NExT-GPT: Any-to-Any Multimodal LLM [75.5656492989924]
We present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT.
We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio.
We introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation.
arXiv Detail & Related papers (2023-09-11T15:02:25Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.