ASQA: Factoid Questions Meet Long-Form Answers
- URL: http://arxiv.org/abs/2204.06092v1
- Date: Tue, 12 Apr 2022 21:58:44 GMT
- Title: ASQA: Factoid Questions Meet Long-Form Answers
- Authors: Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, Ming-Wei Chang
- Abstract summary: This work focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation.
Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary.
We use this notion of correctness to define an automated metric of performance for ASQA.
- Score: 35.11889930792675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An abundance of datasets and availability of reliable evaluation metrics have
resulted in strong progress in factoid question answering (QA). This progress,
however, does not easily transfer to the task of long-form QA, where the goal
is to answer questions that require in-depth explanations. The hurdles include
(i) a lack of high-quality data, and (ii) the absence of a well-defined notion
of the answer's quality. In this work, we address these problems by (i)
releasing a novel dataset and a task that we call ASQA (Answer Summaries for
Questions which are Ambiguous); and (ii) proposing a reliable metric for
measuring performance on ASQA. Our task focuses on factoid questions that are
ambiguous, that is, have different correct answers depending on interpretation.
Answers to ambiguous questions should synthesize factual information from
multiple sources into a long-form summary that resolves the ambiguity. In
contrast to existing long-form QA tasks (such as ELI5), ASQA admits a clear
notion of correctness: a user faced with a good summary should be able to
answer different interpretations of the original ambiguous question. We use
this notion of correctness to define an automated metric of performance for
ASQA. Our analysis demonstrates an agreement between this metric and human
judgments, and reveals a considerable gap between human performance and strong
baselines.
Related papers
- Retrieving Contextual Information for Long-Form Question Answering using Weak Supervision [23.394961301584026]
Long-form question answering (LFQA) aims at generating in-depth answers to end-user questions.
We propose and compare different weak supervision techniques to optimize retrieval for contextual information.
We show that long-form answers often anticipate likely follow-up questions.
arXiv Detail & Related papers (2024-10-11T08:42:02Z) - PEDANTS: Cheap but Effective and Interpretable Answer Equivalence [10.367359022491181]
We provide rubrics and datasets for evaluating machine QA adopted from the Trivia community.
We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore)
arXiv Detail & Related papers (2024-02-17T01:56:19Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that
Don't have a Definitive Answer? [43.03399918557937]
In real-world applications, users often ask questions that don't have a definitive answer.
We introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers.
With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions.
We show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline.
arXiv Detail & Related papers (2023-09-08T23:12:03Z) - Answering Ambiguous Questions with a Database of Questions, Answers, and
Revisions [95.92276099234344]
We present a new state-of-the-art for answering ambiguous questions that exploits a database of unambiguous questions generated from Wikipedia.
Our method improves performance by 15% on recall measures and 10% on measures which evaluate disambiguating questions from predicted outputs.
arXiv Detail & Related papers (2023-08-16T20:23:16Z) - CREPE: Open-Domain Question Answering with False Presuppositions [92.20501870319765]
We introduce CREPE, a QA dataset containing a natural distribution of presupposition failures from online information-seeking forums.
We find that 25% of questions contain false presuppositions, and provide annotations for these presuppositions and their corrections.
We show that adaptations of existing open-domain QA models can find presuppositions moderately well, but struggle when predicting whether a presupposition is factually correct.
arXiv Detail & Related papers (2022-11-30T18:54:49Z) - GooAQ: Open Question Answering with Diverse Answer Types [63.06454855313667]
We present GooAQ, a large-scale dataset with a variety of answer types.
This dataset contains over 5 million questions and 3 million answers collected from Google.
arXiv Detail & Related papers (2021-04-18T05:40:39Z) - QED: A Framework and Dataset for Explanations in Question Answering [27.85923397716627]
We release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset.
A promising result suggests that training on a relatively small amount of QED data can improve question answering.
arXiv Detail & Related papers (2020-09-08T23:34:18Z) - Asking and Answering Questions to Evaluate the Factual Consistency of
Summaries [80.65186293015135]
We propose an automatic evaluation protocol called QAGS (pronounced "kags") to identify factual inconsistencies in a generated summary.
QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source.
We believe QAGS is a promising tool in automatically generating usable and factually consistent text.
arXiv Detail & Related papers (2020-04-08T20:01:09Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.