Multi-Scale Attention for Audio Question Answering
- URL: http://arxiv.org/abs/2305.17993v1
- Date: Mon, 29 May 2023 10:06:58 GMT
- Title: Multi-Scale Attention for Audio Question Answering
- Authors: Guangyao Li, Yixin Xu, Di Hu
- Abstract summary: Audio question answering (AQA) acting as a widely used proxy task to explore scene understanding.
Existing methods mostly extend the structures of visual question answering task to audio ones in a simple pattern.
We present a Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous hybrid attention module and a multi-scale window attention module.
- Score: 9.254814692650523
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio question answering (AQA), acting as a widely used proxy task to explore
scene understanding, has got more attention. The AQA is challenging for it
requires comprehensive temporal reasoning from different scales' events of an
audio scene. However, existing methods mostly extend the structures of visual
question answering task to audio ones in a simple pattern but may not perform
well when perceiving a fine-grained audio scene. To this end, we present a
Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous
hybrid attention module and a multi-scale window attention module. The former
is designed to aggregate unimodal and cross-modal temporal contexts, while the
latter captures sound events of varying lengths and their temporal dependencies
for a more comprehensive understanding. Extensive experiments are conducted to
demonstrate that the proposed MWAFM can effectively explore temporal
information to facilitate AQA in the fine-grained scene.Code:
https://github.com/GeWu-Lab/MWAFM
Related papers
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA)
SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question.
Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z) - Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z) - Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene.
We introduce a novel Progressive Confident Masking Attention Network (PMCANet)
It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z) - Answering Diverse Questions via Text Attached with Key Audio-Visual
Clues [24.347420432207283]
We propose a framework for performing mutual correlation distillation (MCD) to aid question inference.
We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
arXiv Detail & Related papers (2024-03-11T12:51:37Z) - CAT: Enhancing Multimodal Large Language Model to Answer Questions in
Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components.
We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways.
CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual
Event Localization and Video Parsing [7.977954561853929]
We present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures and integrates multi-level temporal features for audio-visual event localization and audio-visual video parsing.
We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively.
arXiv Detail & Related papers (2021-11-24T09:47:26Z) - Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both.
We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously.
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.