MPN: Multimodal Parallel Network for Audio-Visual Event Localization
- URL: http://arxiv.org/abs/2104.02971v1
- Date: Wed, 7 Apr 2021 07:44:22 GMT
- Title: MPN: Multimodal Parallel Network for Audio-Visual Event Localization
- Authors: Jiashuo Yu, Ying Cheng, Rui Feng
- Abstract summary: We propose a Multimodal Parallel Network (MPN), which can perceive global semantics and unmixed local information parallelly.
Our framework achieves the state-of-the-art performance both in fully supervised and weakly supervised settings on the Audio-Visual Event dataset.
- Score: 4.856609995251114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual event localization aims to localize an event that is both
audible and visible in the wild, which is a widespread audio-visual scene
analysis task for unconstrained videos. To address this task, we propose a
Multimodal Parallel Network (MPN), which can perceive global semantics and
unmixed local information parallelly. Specifically, our MPN framework consists
of a classification subnetwork to predict event categories and a localization
subnetwork to predict event boundaries. The classification subnetwork is
constructed by the Multimodal Co-attention Module (MCM) and obtains global
contexts. The localization subnetwork consists of Multimodal Bottleneck
Attention Module (MBAM), which is designed to extract fine-grained
segment-level contents. Extensive experiments demonstrate that our framework
achieves the state-of-the-art performance both in fully supervised and weakly
supervised settings on the Audio-Visual Event (AVE) dataset.
Related papers
- CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization [15.861700882671418]
This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task)<n>We exploit textitcross-modal salient anchors, which are defined as reliable timestamps that are well predicted under weak supervision.<n>We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets.
arXiv Detail & Related papers (2025-08-06T15:49:53Z) - Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration [48.57159286673662]
This paper aims to advance audio-visual scene understanding for longer, untrimmed videos.
We introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration and the Multi-Temporal Granularity Collaboration.
Experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization.
arXiv Detail & Related papers (2024-12-17T07:43:36Z) - Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing [22.655045848201528]
Capturing accurate event semantics for each audio/visual segment is vital.
Each segment may contain multiple events, resulting in semantically mixed holistic features.
We propose a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations.
arXiv Detail & Related papers (2024-12-15T16:54:53Z) - Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - Global and Local Semantic Completion Learning for Vision-Language
Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models.
We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z) - Multi-view Multi-label Anomaly Network Traffic Classification based on
MLP-Mixer Neural Network [55.21501819988941]
Existing network traffic classification based on convolutional neural networks (CNNs) often emphasizes local patterns of traffic data while ignoring global information associations.
We propose an end-to-end network traffic classification method.
arXiv Detail & Related papers (2022-10-30T01:52:05Z) - Leveraging the Video-level Semantic Consistency of Event for
Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task.
It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer.
We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z) - MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains.
Our method contains video and textual segmentation and summarization module, respectively.
It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z) - Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video.
Existing works focus on encoding and aligning audio and visual features at the segment level.
We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.