CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
- URL: http://arxiv.org/abs/2310.16754v2
- Date: Fri, 27 Oct 2023 11:36:47 GMT
- Title: CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
- Authors: Asmar Nadeem, Adrian Hilton, Robert Dawes, Graham Thomas, Armin
Mustafa
- Abstract summary: Existing AVQA methods suffer from two major shortcomings.
The proposed novel CAD network improves the overall performance over the state-of-the-art methods on average by 9.4%.
- Score: 20.155816093525374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the context of Audio Visual Question Answering (AVQA) tasks, the audio
visual modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and
3) Semantic. Existing AVQA methods suffer from two major shortcomings; the
audio-visual (AV) information passing through the network isn't aligned on
Spatial and Temporal levels; and, inter-modal (audio and visual) Semantic
information is often not balanced within a context; this results in poor
performance. In this paper, we propose a novel end-to-end Contextual
Multi-modal Alignment (CAD) network that addresses the challenges in AVQA
methods by i) introducing a parameter-free stochastic Contextual block that
ensures robust audio and visual alignment on the Spatial level; ii) proposing a
pre-training technique for dynamic audio and visual alignment on Temporal level
in a self-supervised setting, and iii) introducing a cross-attention mechanism
to balance audio and visual information on Semantic level. The proposed novel
CAD network improves the overall performance over the state-of-the-art methods
on average by 9.4% on the MUSIC-AVQA dataset. We also demonstrate that our
proposed contributions to AVQA can be added to the existing methods to improve
their performance without additional complexity requirements.
Related papers
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA)
SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question.
Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z) - Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation [7.124066540020968]
Audio-Visual (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic (AVSS) pursues semantic understanding of audio-visual scenes.
Previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization.
We propose a two-stage training strategy called textitStepping Stones, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization
arXiv Detail & Related papers (2024-07-16T15:08:30Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - CLIP-Powered TASS: Target-Aware Single-Stream Network for Audio-Visual Question Answering [6.719652962434731]
This paper proposes a new CLIP-powered target-aware single-stream (TASS) network for audio-visual question answering (AVQA)
It consists of two key components: the target-aware spatial grounding module (TSG+) and the single-stream joint temporal grounding module (JTG)
arXiv Detail & Related papers (2024-05-13T03:25:15Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Leveraging Foundation models for Unsupervised Audio-Visual Segmentation [49.94366155560371]
Audio-Visual (AVS) aims to precisely outline audible objects in a visual scene at the pixel level.
Existing AVS methods require fine-grained annotations of audio-mask pairs in supervised learning fashion.
We introduce unsupervised audio-visual segmentation with no need for task-specific data annotations and model training.
arXiv Detail & Related papers (2023-09-13T05:05:47Z) - Improving Audio-Visual Segmentation with Bidirectional Generation [40.78395709407226]
We introduce a bidirectional generation framework for audio-visual segmentation.
This framework establishes robust correlations between an object's visual characteristics and its associated sound.
We also introduce an implicit volumetric motion estimation module to handle temporal dynamics.
arXiv Detail & Related papers (2023-08-16T11:20:23Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.