Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization
- URL: http://arxiv.org/abs/2412.06208v1
- Date: Mon, 09 Dec 2024 04:58:49 GMT
- Title: Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization
- Authors: Fei Yu, Zhe Xiang, Nan Che, Zhuoran Zhang, Yuandi Li, Junxiao Xue, Zhiguo Wan,
- Abstract summary: Multimodal semantic communication significantly enhances communication efficiency and reliability.
It has broad application prospects in fields such as artificial intelligence, autonomous driving, and smart homes.
This paper proposes a pilot-guided framework for multimodal semantic communication specifically tailored for audio-visual event localization tasks.
- Score: 4.680740822211451
- License:
- Abstract: Multimodal semantic communication, which integrates various data modalities such as text, images, and audio, significantly enhances communication efficiency and reliability. It has broad application prospects in fields such as artificial intelligence, autonomous driving, and smart homes. However, current research primarily relies on analog channels and assumes constant channel states (perfect CSI), which is inadequate for addressing dynamic physical channels and noise in real-world scenarios. Existing methods often focus on single modality tasks and fail to handle multimodal stream data, such as video and audio, and their corresponding tasks. Furthermore, current semantic encoding and decoding modules mainly transmit single modality features, neglecting the need for multimodal semantic enhancement and recognition tasks. To address these challenges, this paper proposes a pilot-guided framework for multimodal semantic communication specifically tailored for audio-visual event localization tasks. This framework utilizes digital pilot codes and channel modules to guide the state of analog channels in real-wold scenarios and designs Euler-based multimodal semantic encoding and decoding that consider time-frequency characteristics based on dynamic channel state. This approach effectively handles multimodal stream source data, especially for audio-visual event localization tasks. Extensive numerical experiments demonstrate the robustness of the proposed framework in channel changes and its support for various communication scenarios. The experimental results show that the framework outperforms existing benchmark methods in terms of Signal-to-Noise Ratio (SNR), highlighting its advantage in semantic communication quality.
Related papers
- Take What You Need: Flexible Multi-Task Semantic Communications with Channel Adaptation [51.53221300103261]
This article introduces a novel channel-adaptive and multi-task-aware semantic communication framework based on a masked auto-encoder architecture.
A channel-aware extractor is employed to dynamically select relevant information in response to real-time channel conditions.
Experimental results demonstrate the superior performance of our framework compared to conventional methods in tasks such as image reconstruction and object detection.
arXiv Detail & Related papers (2025-02-12T09:01:25Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
We present LoCo, a Locality-aware cross-modal Correspondence learning framework for Audio-Visual Events (DAVE)
LoCo applies Locality-aware Correspondence Correction (LCC) to unimodal features via leveraging cross-modal local-correlated properties.
We further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Communication-Efficient Framework for Distributed Image Semantic
Wireless Transmission [68.69108124451263]
Federated learning-based semantic communication (FLSC) framework for multi-task distributed image transmission with IoT devices.
Each link is composed of a hierarchical vision transformer (HVT)-based extractor and a task-adaptive translator.
Channel state information-based multiple-input multiple-output transmission module designed to combat channel fading and noise.
arXiv Detail & Related papers (2023-08-07T16:32:14Z) - Rate-Adaptive Coding Mechanism for Semantic Communications With
Multi-Modal Data [23.597759255020296]
We propose a distributed multi-modal semantic communication framework incorporating the conventional channel encoder/decoder.
We establish a general rate-adaptive coding mechanism for various types of multi-modal semantic tasks.
Numerical results show that the proposed mechanism fares better than both conventional communication and existing semantic communication systems.
arXiv Detail & Related papers (2023-05-18T07:31:37Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - One-to-Many Semantic Communication Systems: Design, Implementation,
Performance Evaluation [35.21413988605204]
We propose a one-to-many semantic communication system called MR_DeepSC.
By leveraging semantic features for different users, a semantic recognizer is built to distinguish different users.
The proposed MR_DeepSC can achieve the best performance in terms of BLEU score.
arXiv Detail & Related papers (2022-09-20T02:48:34Z) - Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video.
Existing works focus on encoding and aligning audio and visual features at the segment level.
We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.