Multi-level Attention Fusion Network for Audio-visual Event Recognition
- URL: http://arxiv.org/abs/2106.06736v1
- Date: Sat, 12 Jun 2021 10:24:52 GMT
- Title: Multi-level Attention Fusion Network for Audio-visual Event Recognition
- Authors: Mathilde Brousmiche and Jean Rouat and St\'ephane Dupont
- Abstract summary: Event classification is inherently sequential and multimodal.
Deep neural models need to dynamically focus on the most relevant time window and/or modality of a video.
We propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition.
- Score: 6.767885381740951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event classification is inherently sequential and multimodal. Therefore, deep
neural models need to dynamically focus on the most relevant time window and/or
modality of a video. In this study, we propose the Multi-level Attention Fusion
network (MAFnet), an architecture that can dynamically fuse visual and audio
information for event recognition. Inspired by prior studies in neuroscience,
we couple both modalities at different levels of visual and audio paths.
Furthermore, the network dynamically highlights a modality at a given time
window relevant to classify events. Experimental results in AVE (Audio-Visual
Event), UCF51, and Kinetics-Sounds datasets show that the approach can
effectively improve the accuracy in audio-visual event classification. Code is
available at: https://github.com/numediart/MAFnet
Related papers
- Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information.
We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance.
Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - TMac: Temporal Multi-Modal Graph Learning for Acoustic Event
Classification [60.038979555455775]
We propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac.
In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments.
Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance.
arXiv Detail & Related papers (2023-09-21T07:39:08Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Audio-Visual Fusion Layers for Event Type Aware Video Recognition [86.22811405685681]
We propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme.
We show that our network is formulated with single labels, but it can output additional true multi-labels to represent the given videos.
arXiv Detail & Related papers (2022-02-12T02:56:22Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.