Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization
- URL: http://arxiv.org/abs/2412.06208v1
- Date: Mon, 09 Dec 2024 04:58:49 GMT
- Title: Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization
- Authors: Fei Yu, Zhe Xiang, Nan Che, Zhuoran Zhang, Yuandi Li, Junxiao Xue, Zhiguo Wan,
- Abstract summary: Multimodal semantic communication significantly enhances communication efficiency and reliability.<n>It has broad application prospects in fields such as artificial intelligence, autonomous driving, and smart homes.<n>This paper proposes a pilot-guided framework for multimodal semantic communication specifically tailored for audio-visual event localization tasks.
- Score: 4.680740822211451
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal semantic communication, which integrates various data modalities such as text, images, and audio, significantly enhances communication efficiency and reliability. It has broad application prospects in fields such as artificial intelligence, autonomous driving, and smart homes. However, current research primarily relies on analog channels and assumes constant channel states (perfect CSI), which is inadequate for addressing dynamic physical channels and noise in real-world scenarios. Existing methods often focus on single modality tasks and fail to handle multimodal stream data, such as video and audio, and their corresponding tasks. Furthermore, current semantic encoding and decoding modules mainly transmit single modality features, neglecting the need for multimodal semantic enhancement and recognition tasks. To address these challenges, this paper proposes a pilot-guided framework for multimodal semantic communication specifically tailored for audio-visual event localization tasks. This framework utilizes digital pilot codes and channel modules to guide the state of analog channels in real-wold scenarios and designs Euler-based multimodal semantic encoding and decoding that consider time-frequency characteristics based on dynamic channel state. This approach effectively handles multimodal stream source data, especially for audio-visual event localization tasks. Extensive numerical experiments demonstrate the robustness of the proposed framework in channel changes and its support for various communication scenarios. The experimental results show that the framework outperforms existing benchmark methods in terms of Signal-to-Noise Ratio (SNR), highlighting its advantage in semantic communication quality.
Related papers
- AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.
Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - SIMAC: A Semantic-Driven Integrated Multimodal Sensing And Communication Framework [22.924064428134507]
Single-modality sensing faces limitations in accuracy and capability, and its decoupled implementation with communication systems increases latency.
We propose a semantic-driven integrated multimodal sensing and communication framework to overcome these challenges.
arXiv Detail & Related papers (2025-03-11T01:04:42Z) - Take What You Need: Flexible Multi-Task Semantic Communications with Channel Adaptation [51.53221300103261]
This article introduces a novel channel-adaptive and multi-task-aware semantic communication framework based on a masked auto-encoder architecture.
A channel-aware extractor is employed to dynamically select relevant information in response to real-time channel conditions.
Experimental results demonstrate the superior performance of our framework compared to conventional methods in tasks such as image reconstruction and object detection.
arXiv Detail & Related papers (2025-02-12T09:01:25Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Communication-Efficient Framework for Distributed Image Semantic
Wireless Transmission [68.69108124451263]
Federated learning-based semantic communication (FLSC) framework for multi-task distributed image transmission with IoT devices.
Each link is composed of a hierarchical vision transformer (HVT)-based extractor and a task-adaptive translator.
Channel state information-based multiple-input multiple-output transmission module designed to combat channel fading and noise.
arXiv Detail & Related papers (2023-08-07T16:32:14Z) - Rate-Adaptive Coding Mechanism for Semantic Communications With
Multi-Modal Data [23.597759255020296]
We propose a distributed multi-modal semantic communication framework incorporating the conventional channel encoder/decoder.
We establish a general rate-adaptive coding mechanism for various types of multi-modal semantic tasks.
Numerical results show that the proposed mechanism fares better than both conventional communication and existing semantic communication systems.
arXiv Detail & Related papers (2023-05-18T07:31:37Z) - One-to-Many Semantic Communication Systems: Design, Implementation,
Performance Evaluation [35.21413988605204]
We propose a one-to-many semantic communication system called MR_DeepSC.
By leveraging semantic features for different users, a semantic recognizer is built to distinguish different users.
The proposed MR_DeepSC can achieve the best performance in terms of BLEU score.
arXiv Detail & Related papers (2022-09-20T02:48:34Z) - Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video.
Existing works focus on encoding and aligning audio and visual features at the segment level.
We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.