MPN: Multimodal Parallel Network for Audio-Visual Event Localization
- URL: http://arxiv.org/abs/2104.02971v1
- Date: Wed, 7 Apr 2021 07:44:22 GMT
- Title: MPN: Multimodal Parallel Network for Audio-Visual Event Localization
- Authors: Jiashuo Yu, Ying Cheng, Rui Feng
- Abstract summary: We propose a Multimodal Parallel Network (MPN), which can perceive global semantics and unmixed local information parallelly.
Our framework achieves the state-of-the-art performance both in fully supervised and weakly supervised settings on the Audio-Visual Event dataset.
- Score: 4.856609995251114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual event localization aims to localize an event that is both
audible and visible in the wild, which is a widespread audio-visual scene
analysis task for unconstrained videos. To address this task, we propose a
Multimodal Parallel Network (MPN), which can perceive global semantics and
unmixed local information parallelly. Specifically, our MPN framework consists
of a classification subnetwork to predict event categories and a localization
subnetwork to predict event boundaries. The classification subnetwork is
constructed by the Multimodal Co-attention Module (MCM) and obtains global
contexts. The localization subnetwork consists of Multimodal Bottleneck
Attention Module (MBAM), which is designed to extract fine-grained
segment-level contents. Extensive experiments demonstrate that our framework
achieves the state-of-the-art performance both in fully supervised and weakly
supervised settings on the Audio-Visual Event (AVE) dataset.
Related papers
- Global and Local Semantic Completion Learning for Vision-Language
Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models.
We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z) - Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale
Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events.
Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z) - Multi-view Multi-label Anomaly Network Traffic Classification based on
MLP-Mixer Neural Network [55.21501819988941]
Existing network traffic classification based on convolutional neural networks (CNNs) often emphasizes local patterns of traffic data while ignoring global information associations.
We propose an end-to-end network traffic classification method.
arXiv Detail & Related papers (2022-10-30T01:52:05Z) - Leveraging the Video-level Semantic Consistency of Event for
Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task.
It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer.
We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z) - MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains.
Our method contains video and textual segmentation and summarization module, respectively.
It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z) - Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video.
Existing works focus on encoding and aligning audio and visual features at the segment level.
We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z) - Multi-level Attention Fusion Network for Audio-visual Event Recognition [6.767885381740951]
Event classification is inherently sequential and multimodal.
Deep neural models need to dynamically focus on the most relevant time window and/or modality of a video.
We propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition.
arXiv Detail & Related papers (2021-06-12T10:24:52Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Dynamic Context-guided Capsule Network for Multimodal Machine
Translation [131.37130887834667]
Multimodal machine translation (MMT) mainly focuses on enhancing text-only translation with visual features.
We propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT.
Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN.
arXiv Detail & Related papers (2020-09-04T06:18:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.