Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
- URL: http://arxiv.org/abs/2204.09634v1
- Date: Wed, 20 Apr 2022 17:28:53 GMT
- Title: Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering
- Authors: Samuel Lipping, Parthasaarathy Sudarsanam, Konstantinos Drossos,
Tuomas Virtanen
- Abstract summary: We introduce Clotho-AQA, a dataset for Audio question answering consisting of 1991 audio files each between 15 to 30 seconds in duration selected from the Clotho dataset.
For each audio file, we collect six different questions and corresponding answers by crowdsourcing using Amazon Mechanical Turk.
We present two baseline experiments to describe the usage of our dataset for the AQA task.
- Score: 18.581514902689346
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Audio question answering (AQA) is a multimodal translation task where a
system analyzes an audio signal and a natural language question, to generate a
desirable natural language answer. In this paper, we introduce Clotho-AQA, a
dataset for Audio question answering consisting of 1991 audio files each
between 15 to 30 seconds in duration selected from the Clotho dataset [1]. For
each audio file, we collect six different questions and corresponding answers
by crowdsourcing using Amazon Mechanical Turk. The questions and answers are
produced by different annotators. Out of the six questions for each audio, two
questions each are designed to have 'yes' and 'no' as answers, while the
remaining two questions have other single-word answers. For each question, we
collect answers from three different annotators. We also present two baseline
experiments to describe the usage of our dataset for the AQA task - an
LSTM-based multimodal binary classifier for 'yes' or 'no' type answers and an
LSTM-based multimodal multi-class classifier for 828 single-word answers. The
binary classifier achieved an accuracy of 62.7% and the multi-class classifier
achieved a top-1 accuracy of 54.2% and a top-5 accuracy of 93.7%. Clotho-AQA
dataset is freely available online at https://zenodo.org/record/6473207.
Related papers
- Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment.
Multiple LLM-based agents independently explore and then answer queries about a household environment.
We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z) - Attention-Based Methods For Audio Question Answering [16.82832919748399]
We propose neural network architectures based on self-attention and cross-attention for the audio question answering task.
All our models are trained on the recently proposed Clotho-AQA dataset for both binary yes/no questions and single-word answer questions.
arXiv Detail & Related papers (2023-05-31T12:00:51Z) - Activity report analysis with automatic single or multispan answer
extraction [0.21485350418225244]
We create a new smart home environment dataset comprised of questions paired with single-span or multi-span answers depending on the question and context queried.
Our experiments show that the proposed model outperforms state-of-the-art QA models on our dataset.
arXiv Detail & Related papers (2022-09-09T06:33:29Z) - An Answer Verbalization Dataset for Conversational Question Answerings
over Knowledge Graphs [9.979689965471428]
This paper contributes to the state-of-the-art by extending an existing ConvQA dataset with verbalized answers.
We perform experiments with five sequence-to-sequence models on generating answer responses while maintaining grammatical correctness.
arXiv Detail & Related papers (2022-08-13T21:21:28Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - In Situ Answer Sentence Selection at Web-scale [120.69820139008138]
Passage-based Extracting Answer Sentence In-place (PEASI) is a novel design for AS2 optimized for Web-scale setting.
We train PEASI in a multi-task learning framework that encourages feature sharing between the components: passage reranker and passage-based answer sentence extractor.
Experiments show PEASI effectively outperforms the current state-of-the-art setting for AS2, i.e., a point-wise model for ranking sentences independently, by 6.51% in accuracy.
arXiv Detail & Related papers (2022-01-16T06:36:00Z) - Zero-Shot Open-Book Question Answering [0.0]
This article proposes a solution for answering natural language questions from technical documents with no domain-specific labeled data (zero-shot)
We are introducing a new test dataset for open-book QA based on real customer questions on AWS technical documentation.
We were able to achieve 49% F1 and 39% exact score (EM) end-to-end with no domain-specific training.
arXiv Detail & Related papers (2021-11-22T20:38:41Z) - QAConv: Question Answering on Informative Conversations [85.2923607672282]
We focus on informative conversations including business emails, panel discussions, and work channels.
In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions.
arXiv Detail & Related papers (2021-05-14T15:53:05Z) - GooAQ: Open Question Answering with Diverse Answer Types [63.06454855313667]
We present GooAQ, a large-scale dataset with a variety of answer types.
This dataset contains over 5 million questions and 3 million answers collected from Google.
arXiv Detail & Related papers (2021-04-18T05:40:39Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z) - ParaQA: A Question Answering Dataset with Paraphrase Responses for
Single-Turn Conversation [5.087932295628364]
ParaQA is a dataset with multiple paraphrased responses for single-turn conversation over knowledge graphs (KG)
The dataset was created using a semi-automated framework for generating diverse paraphrasing of the answers using techniques such as back-translation.
arXiv Detail & Related papers (2021-03-13T18:53:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.