NAAQA: A Neural Architecture for Acoustic Question Answering
- URL: http://arxiv.org/abs/2106.06147v3
- Date: Fri, 12 Jan 2024 14:58:10 GMT
- Title: NAAQA: A Neural Architecture for Acoustic Question Answering
- Authors: Jerome Abdelnour, Jean Rouat, Giampiero Salvi
- Abstract summary: The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene.
We propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs.
We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs.
- Score: 8.364707318181193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of the Acoustic Question Answering (AQA) task is to answer a
free-form text question about the content of an acoustic scene. It was inspired
by the Visual Question Answering (VQA) task. In this paper, based on the
previously introduced CLEAR dataset, we propose a new benchmark for AQA, namely
CLEAR2, that emphasizes the specific challenges of acoustic inputs. These
include handling of variable duration scenes, and scenes built with elementary
sounds that differ between training and test set. We also introduce NAAQA, a
neural architecture that leverages specific properties of acoustic inputs. The
use of 1D convolutions in time and frequency to process 2D spectro-temporal
representations of acoustic content shows promising results and enables
reductions in model complexity. We show that time coordinate maps augment
temporal localization capabilities which enhance performance of the network by
~17 percentage points. On the other hand, frequency coordinate maps have little
influence on this task. NAAQA achieves 79.5% of accuracy on the AQA task with
~4 times fewer parameters than the previously explored VQA model. We evaluate
the perfomance of NAAQA on an independent data set reconstructed from DAQA. We
also test the addition of a MALiMo module in our model on both CLEAR2 and DAQA.
We provide a detailed analysis of the results for the different question types.
We release the code to produce CLEAR2 as well as NAAQA to foster research in
this newly emerging machine learning task.
Related papers
- Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering [25.577314828249897]
We propose a novel dataset, MUSIC-AVQA-R, crafted in two steps: rephrasing questions within the test split of a public dataset (MUSIC-AVQA) and introducing distribution shifts to split questions.
Experimental results show that this architecture achieves state-of-the-art performance on MUSIC-AVQA-R, notably obtaining a significant improvement of 9.32%.
arXiv Detail & Related papers (2024-04-18T09:16:02Z) - AQUALLM: Audio Question Answering Data Generation Using Large Language
Models [2.2232550112727267]
We introduce a scalable AQA data generation pipeline, which relies on Large Language Models (LLMs)
We present three extensive and high-quality benchmark datasets for AQA.
Models trained on our datasets demonstrate enhanced generalizability when compared to models trained using human-annotated AQA data.
arXiv Detail & Related papers (2023-12-28T20:01:27Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Attention-Based Methods For Audio Question Answering [16.82832919748399]
We propose neural network architectures based on self-attention and cross-attention for the audio question answering task.
All our models are trained on the recently proposed Clotho-AQA dataset for both binary yes/no questions and single-word answer questions.
arXiv Detail & Related papers (2023-05-31T12:00:51Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - DUAL: Textless Spoken Question Answering with Speech Discrete Unit
Adaptive Learning [66.71308154398176]
Spoken Question Answering (SQA) has gained research attention and made remarkable progress in recent years.
Existing SQA methods rely on Automatic Speech Recognition (ASR) transcripts, which are time and cost-prohibitive to collect.
This work proposes an ASR transcript-free SQA framework named Discrete Unit Adaptive Learning (DUAL), which leverages unlabeled data for pre-training and is fine-tuned by the SQA downstream task.
arXiv Detail & Related papers (2022-03-09T17:46:22Z) - ASQ: Automatically Generating Question-Answer Pairs using AMRs [1.0878040851638]
We introduce ASQ, a tool to automatically mine questions and answers from a sentence, using its Abstract Meaning Representation (AMR)
A qualitative evaluation of the output generated by ASQ from the AMR 2.0 data shows that the question-answer pairs are natural and valid.
We intend to make this tool and the results publicly available for others to use and build upon.
arXiv Detail & Related papers (2021-05-20T20:38:05Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.