Related papers: MPN: Multimodal Parallel Network for Audio-Visual Event Localization

MPN: Multimodal Parallel Network for Audio-Visual Event Localization

URL: http://arxiv.org/abs/2104.02971v1
Date: Wed, 7 Apr 2021 07:44:22 GMT
Title: MPN: Multimodal Parallel Network for Audio-Visual Event Localization
Authors: Jiashuo Yu, Ying Cheng, Rui Feng
Abstract summary: We propose a Multimodal Parallel Network (MPN), which can perceive global semantics and unmixed local information parallelly. Our framework achieves the state-of-the-art performance both in fully supervised and weakly supervised settings on the Audio-Visual Event dataset.
Score: 4.856609995251114
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-visual event localization aims to localize an event that is both audible and visible in the wild, which is a widespread audio-visual scene analysis task for unconstrained videos. To address this task, we propose a Multimodal Parallel Network (MPN), which can perceive global semantics and unmixed local information parallelly. Specifically, our MPN framework consists of a classification subnetwork to predict event categories and a localization subnetwork to predict event boundaries. The classification subnetwork is constructed by the Multimodal Co-attention Module (MCM) and obtains global contexts. The localization subnetwork consists of Multimodal Bottleneck Attention Module (MBAM), which is designed to extract fine-grained segment-level contents. Extensive experiments demonstrate that our framework achieves the state-of-the-art performance both in fully supervised and weakly supervised settings on the Audio-Visual Event (AVE) dataset.

Related papers

CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization [15.861700882671418]
This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task)<n>We exploit textitcross-modal salient anchors, which are defined as reliable timestamps that are well predicted under weak supervision.<n>We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets.
arXiv Detail & Related papers (2025-08-06T15:49:53Z)
Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration [48.57159286673662]
This paper aims to advance audio-visual scene understanding for longer, untrimmed videos. We introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration and the Multi-Temporal Granularity Collaboration. Experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization.
arXiv Detail & Related papers (2024-12-17T07:43:36Z)
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing [22.655045848201528]
Capturing accurate event semantics for each audio/visual segment is vital. Each segment may contain multiple events, resulting in semantically mixed holistic features. We propose a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations.
arXiv Detail & Related papers (2024-12-15T16:54:53Z)
Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z)
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z)
Global and Local Semantic Completion Learning for Vision-Language Pre-training [34.740507502215536]
Cross-modal alignment plays a crucial role in vision-language pre-training models. We propose a novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate global-local alignment and local-local alignment simultaneously.
arXiv Detail & Related papers (2023-06-12T13:20:29Z)
Multi-view Multi-label Anomaly Network Traffic Classification based on MLP-Mixer Neural Network [55.21501819988941]
Existing network traffic classification based on convolutional neural networks (CNNs) often emphasizes local patterns of traffic data while ignoring global information associations. We propose an end-to-end network traffic classification method.
arXiv Detail & Related papers (2022-10-30T01:52:05Z)
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task. It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z)
MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains. Our method contains video and textual segmentation and summarization module, respectively. It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z)
Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video. Existing works focus on encoding and aligning audio and visual features at the segment level. We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.