Attention-Based Methods For Audio Question Answering
- URL: http://arxiv.org/abs/2305.19769v1
- Date: Wed, 31 May 2023 12:00:51 GMT
- Title: Attention-Based Methods For Audio Question Answering
- Authors: Parthasaarathy Sudarsanam, Tuomas Virtanen
- Abstract summary: We propose neural network architectures based on self-attention and cross-attention for the audio question answering task.
All our models are trained on the recently proposed Clotho-AQA dataset for both binary yes/no questions and single-word answer questions.
- Score: 16.82832919748399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio question answering (AQA) is the task of producing natural language
answers when a system is provided with audio and natural language questions. In
this paper, we propose neural network architectures based on self-attention and
cross-attention for the AQA task. The self-attention layers extract powerful
audio and textual representations. The cross-attention maps audio features that
are relevant to the textual features to produce answers. All our models are
trained on the recently proposed Clotho-AQA dataset for both binary yes/no
questions and single-word answer questions. Our results clearly show
improvement over the reference method reported in the original paper. On the
yes/no binary classification task, our proposed model achieves an accuracy of
68.3% compared to 62.7% in the reference model. For the single-word answers
multiclass classifier, our model produces a top-1 and top-5 accuracy of 57.9%
and 99.8% compared to 54.2% and 93.7% in the reference model respectively. We
further discuss some of the challenges in the Clotho-AQA dataset such as the
presence of the same answer word in multiple tenses, singular and plural forms,
and the presence of specific and generic answers to the same question. We
address these issues and present a revised version of the dataset.
Related papers
- SubjECTive-QA: Measuring Subjectivity in Earnings Call Transcripts' QA Through Six-Dimensional Feature Analysis [4.368712652579087]
SubjECTive-QA is a human annotated dataset on Earnings Call Transcripts' (ECTs)
The dataset includes 49,446 annotations for long-form QA pairs across six features: Assertive, Cautious, Optimistic, Specific, Clear, and Relevant.
Our findings are that the best-performing Pre-trained Language Model (PLM), RoBERTa-base, has similar weighted F1 scores to Llama-3-70b-Chat on features with lower subjectivity.
arXiv Detail & Related papers (2024-10-28T01:17:34Z) - GSQA: An End-to-End Model for Generative Spoken Question Answering [54.418723701886115]
We introduce the first end-to-end Generative Spoken Question Answering (GSQA) model that empowers the system to engage in abstractive reasoning.
Our model surpasses the previous extractive model by 3% on extractive QA datasets.
Our GSQA model shows the potential to generalize to a broad spectrum of questions, thus further expanding the spoken question answering capabilities of abstractive QA.
arXiv Detail & Related papers (2023-12-15T13:33:18Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering [18.581514902689346]
We introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset.
For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk.
We present two baseline experiments to describe the usage of our dataset for the AQA task.
arXiv Detail & Related papers (2022-04-20T17:28:53Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - ListReader: Extracting List-form Answers for Opinion Questions [18.50111430378249]
ListReader is a neural ex-tractive QA model for list-form answer.
In addition to learning the alignment between the question and content, we introduce a heterogeneous graph neural network.
Our model adopts a co-extraction setting that can extract either span- or sentence-level answers.
arXiv Detail & Related papers (2021-10-22T10:33:08Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - NAAQA: A Neural Architecture for Acoustic Question Answering [8.364707318181193]
The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene.
We propose a new benchmark for AQA, namely CLEAR2, that emphasizes the specific challenges of acoustic inputs.
We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs.
arXiv Detail & Related papers (2021-06-11T03:05:48Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.