Related papers: Multi-Scale Attention for Audio Question Answering

Multi-Scale Attention for Audio Question Answering

URL: http://arxiv.org/abs/2305.17993v1
Date: Mon, 29 May 2023 10:06:58 GMT
Title: Multi-Scale Attention for Audio Question Answering
Authors: Guangyao Li, Yixin Xu, Di Hu
Abstract summary: Audio question answering (AQA) acting as a widely used proxy task to explore scene understanding. Existing methods mostly extend the structures of visual question answering task to audio ones in a simple pattern. We present a Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous hybrid attention module and a multi-scale window attention module.
Score: 9.254814692650523
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio question answering (AQA), acting as a widely used proxy task to explore scene understanding, has got more attention. The AQA is challenging for it requires comprehensive temporal reasoning from different scales' events of an audio scene. However, existing methods mostly extend the structures of visual question answering task to audio ones in a simple pattern but may not perform well when perceiving a fine-grained audio scene. To this end, we present a Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous hybrid attention module and a multi-scale window attention module. The former is designed to aggregate unimodal and cross-modal temporal contexts, while the latter captures sound events of varying lengths and their temporal dependencies for a more comprehensive understanding. Extensive experiments are conducted to demonstrate that the proposed MWAFM can effectively explore temporal information to facilitate AQA in the fine-grained scene.Code: https://github.com/GeWu-Lab/MWAFM

Related papers

Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration [48.57159286673662]
This paper aims to advance audio-visual scene understanding for longer, untrimmed videos. We introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration and the Multi-Temporal Granularity Collaboration. Experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization.
arXiv Detail & Related papers (2024-12-17T07:43:36Z)
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA) SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question. Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z)
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z)
Progressive Confident Masking Attention Network for Audio-Visual Segmentation [8.591836399688052]
A challenging problem known as Audio-Visual has emerged, intending to produce segmentation maps for sounding objects within a scene. We introduce a novel Progressive Confident Masking Attention Network (PMCANet) It leverages attention mechanisms to uncover the intrinsic correlations between audio signals and visual frames.
arXiv Detail & Related papers (2024-06-04T14:21:41Z)
Answering Diverse Questions via Text Attached with Key Audio-Visual Clues [24.347420432207283]
We propose a framework for performing mutual correlation distillation (MCD) to aid question inference. We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
arXiv Detail & Related papers (2024-03-11T12:51:37Z)
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios [69.94398424864595]
This paper focuses on the challenge of answering questions in scenarios composed of rich and complex dynamic audio-visual components. We introduce the CAT, which enhances Multimodal Large Language Models (MLLMs) in three ways. CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios.
arXiv Detail & Related papers (2024-03-07T16:31:02Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
Locate before Answering: Answer Guided Question Localization for Video Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model. It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z)
Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos. Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z)
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing [7.977954561853929]
We present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures and integrates multi-level temporal features for audio-visual event localization and audio-visual video parsing. We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively.
arXiv Detail & Related papers (2021-11-24T09:47:26Z)
Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both. We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously. Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.