Learning to Answer Questions in Dynamic Audio-Visual Scenarios
- URL: http://arxiv.org/abs/2203.14072v1
- Date: Sat, 26 Mar 2022 13:03:42 GMT
- Title: Learning to Answer Questions in Dynamic Audio-Visual Scenarios
- Authors: Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen and Di
Hu
- Abstract summary: We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
- Score: 81.19017026999218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we focus on the Audio-Visual Question Answering (AVQA) task,
which aims to answer questions regarding different visual objects, sounds, and
their associations in videos. The problem requires comprehensive multimodal
understanding and spatio-temporal reasoning over audio-visual scenes. To
benchmark this task and facilitate our study, we introduce a large-scale
MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering
33 different question templates spanning over different modalities and question
types. We develop several baselines and introduce a spatio-temporal grounded
audio-visual network for the AVQA problem. Our results demonstrate that AVQA
benefits from multisensory perception and our model outperforms recent A-, V-,
and AVQA approaches. We believe that our built dataset has the potential to
serve as testbed for evaluating and promoting progress in audio-visual scene
understanding and spatio-temporal reasoning. Code and dataset:
http://gewu-lab.github.io/MUSIC-AVQA/
Related papers
- SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering [53.00674706030977]
We introduce the Source-aware Semantic Representation Network (SaSR-Net), a novel model designed for Audio-Visual Question Answering (AVQA)
SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with the corresponding question.
Experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms state-of-the-art AVQA methods.
arXiv Detail & Related papers (2024-11-07T18:12:49Z) - Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z) - Answering Diverse Questions via Text Attached with Key Audio-Visual
Clues [24.347420432207283]
We propose a framework for performing mutual correlation distillation (MCD) to aid question inference.
We evaluate the proposed method on two publicly available datasets containing multiple question-and-answer pairs.
arXiv Detail & Related papers (2024-03-11T12:51:37Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for
Autonomous Driving Scenario [77.14723238359318]
NuScenesQA is the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs.
We leverage existing 3D detection annotations to generate scene graphs and design question templates manually.
We develop a series of baselines that employ advanced 3D detection and VQA techniques.
arXiv Detail & Related papers (2023-05-24T07:40:50Z) - A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge [39.788346536244504]
A-OKVQA is a crowdsourced dataset composed of about 25K questions.
We demonstrate the potential of this new dataset through a detailed analysis of its contents.
arXiv Detail & Related papers (2022-06-03T17:52:27Z) - NAAQA: A Neural Architecture for Acoustic Question Answering [8.364707318181193]
The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene.
We propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs.
We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs.
arXiv Detail & Related papers (2021-06-11T03:05:48Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.