Multi-Modulation Network for Audio-Visual Event Localization
- URL: http://arxiv.org/abs/2108.11773v2
- Date: Mon, 30 Aug 2021 13:11:02 GMT
- Title: Multi-Modulation Network for Audio-Visual Event Localization
- Authors: Hao Wang, Zheng-Jun Zha, Liang Li, Xuejin Chen, Jiebo Luo
- Abstract summary: We study the problem of localizing audio-visual events that are both audible and visible in a video.
Existing works focus on encoding and aligning audio and visual features at the segment level.
We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
- Score: 138.14529518908736
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of localizing audio-visual events that are both audible
and visible in a video. Existing works focus on encoding and aligning audio and
visual features at the segment level while neglecting informative correlation
between segments of the two modalities and between multi-scale event proposals.
We propose a novel MultiModulation Network (M2N) to learn the above correlation
and leverage it as semantic guidance to modulate the related auditory, visual,
and fused features. In particular, during feature encoding, we propose
cross-modal normalization and intra-modal normalization. The former modulates
the features of two modalities by establishing and exploiting the cross-modal
relationship. The latter modulates the features of a single modality with the
event-relevant semantic guidance of the same modality. In the fusion stage,we
propose a multi-scale proposal modulating module and a multi-alignment segment
modulating module to introduce multi-scale event proposals and enable dense
matching between cross-modal segments. With the auditory, visual, and fused
features modulated by the correlation information regarding audio-visual
events, M2N performs accurate event localization. Extensive experiments
conducted on the AVE dataset demonstrate that our proposed method outperforms
the state of the art in both supervised event localization and cross-modality
localization.
Related papers
- Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration [48.57159286673662]
This paper aims to advance audio-visual scene understanding for longer, untrimmed videos.
We introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration and the Multi-Temporal Granularity Collaboration.
Experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization.
arXiv Detail & Related papers (2024-12-17T07:43:36Z) - Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
We present LoCo, a Locality-aware cross-modal Correspondence learning framework for Audio-Visual Events (DAVE)
LoCo applies Locality-aware Correspondence Correction (LCC) to unimodal features via leveraging cross-modal local-correlated properties.
We further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information.
We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance.
Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Cross-Modal Reasoning with Event Correlation for Video Question
Answering [32.332251488360185]
We introduce the dense caption modality as a new auxiliary and distill event-correlated information from it to infer the correct answer.
We employ cross-modal reasoning modules for explicitly modeling inter-modal relationships and aggregating relevant information across different modalities.
We propose a question-guided self-adaptive multi-modal fusion module to collect the question-oriented and event-correlated evidence through multi-step reasoning.
arXiv Detail & Related papers (2023-12-20T02:30:39Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - Leveraging the Video-level Semantic Consistency of Event for
Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task.
It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer.
We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.