Multi-level Attention Fusion Network for Audio-visual Event Recognition
- URL: http://arxiv.org/abs/2106.06736v1
- Date: Sat, 12 Jun 2021 10:24:52 GMT
- Title: Multi-level Attention Fusion Network for Audio-visual Event Recognition
- Authors: Mathilde Brousmiche and Jean Rouat and St\'ephane Dupont
- Abstract summary: Event classification is inherently sequential and multimodal.
Deep neural models need to dynamically focus on the most relevant time window and/or modality of a video.
We propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition.
- Score: 6.767885381740951
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Event classification is inherently sequential and multimodal. Therefore, deep
neural models need to dynamically focus on the most relevant time window and/or
modality of a video. In this study, we propose the Multi-level Attention Fusion
network (MAFnet), an architecture that can dynamically fuse visual and audio
information for event recognition. Inspired by prior studies in neuroscience,
we couple both modalities at different levels of visual and audio paths.
Furthermore, the network dynamically highlights a modality at a given time
window relevant to classify events. Experimental results in AVE (Audio-Visual
Event), UCF51, and Kinetics-Sounds datasets show that the approach can
effectively improve the accuracy in audio-visual event classification. Code is
available at: https://github.com/numediart/MAFnet
Related papers
- Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - TMac: Temporal Multi-Modal Graph Learning for Acoustic Event
Classification [60.038979555455775]
We propose a Temporal Multi-modal graph learning method for Acoustic event Classification, called TMac.
In particular, we construct a temporal graph for each acoustic event, dividing its audio data and video data into multiple segments.
Several experiments are conducted to demonstrate TMac outperforms other SOTA models in performance.
arXiv Detail & Related papers (2023-09-21T07:39:08Z) - Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Furnishing Sound Event Detection with Language Model Abilities [11.435984426303419]
We propose an elegant method that aligns audio features and text features to accomplish sound event classification and temporal location.
The framework consists of an acoustic encoder, a contrastive module that align the corresponding representations of the text and audio, and a decoupled language decoder.
arXiv Detail & Related papers (2023-08-22T15:59:06Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - Audio-Visual Fusion Layers for Event Type Aware Video Recognition [86.22811405685681]
We propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme.
We show that our network is formulated with single labels, but it can output additional true multi-labels to represent the given videos.
arXiv Detail & Related papers (2022-02-12T02:56:22Z) - Joint Learning of Visual-Audio Saliency Prediction and Sound Source
Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video.
The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.