Hurdles to Progress in Long-form Question Answering
- URL: http://arxiv.org/abs/2103.06332v1
- Date: Wed, 10 Mar 2021 20:32:30 GMT
- Title: Hurdles to Progress in Long-form Question Answering
- Authors: Kalpesh Krishna, Aurko Roy, Mohit Iyyer
- Abstract summary: We show that the task formulation raises fundamental challenges regarding evaluation and dataset creation.
We first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance.
- Score: 34.805039943215284
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of long-form question answering (LFQA) involves retrieving documents
relevant to a given question and using them to generate a paragraph-length
answer. While many models have recently been proposed for LFQA, we show in this
paper that the task formulation raises fundamental challenges regarding
evaluation and dataset creation that currently preclude meaningful modeling
progress. To demonstrate these challenges, we first design a new system that
relies on sparse attention and contrastive retriever learning to achieve
state-of-the-art performance on the ELI5 LFQA dataset. While our system tops
the public leaderboard, a detailed analysis reveals several troubling trends:
(1) our system's generated answers are not actually grounded in the documents
that it retrieves; (2) ELI5 contains significant train / test overlap, as at
least 81% of ELI5 validation questions occur in paraphrased form in the
training set; (3) ROUGE-L is not an informative metric of generated answer
quality and can be easily gamed; and (4) human evaluations used for other text
generation tasks are unreliable for LFQA. We provide suggestions to mitigate
each of these issues, which we hope will lead to more rigorous LFQA research
and meaningful progress in the future.
Related papers
- Localizing and Mitigating Errors in Long-form Question Answering [79.63372684264921]
Long-form question answering (LFQA) aims to provide thorough and in-depth answers to complex questions, enhancing comprehension.
This work introduces HaluQuestQA, the first hallucination dataset with localized error annotations for human-written and model-generated LFQA answers.
arXiv Detail & Related papers (2024-07-16T17:23:16Z) - Long-form Question Answering: An Iterative Planning-Retrieval-Generation
Approach [28.849548176802262]
Long-form question answering (LFQA) poses a challenge as it involves generating detailed answers in the form of paragraphs.
We propose an LFQA model with iterative Planning, Retrieval, and Generation.
We find that our model outperforms the state-of-the-art models on various textual and factual metrics for the LFQA task.
arXiv Detail & Related papers (2023-11-15T21:22:27Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - QASnowball: An Iterative Bootstrapping Framework for High-Quality
Question-Answering Data Generation [67.27999343730224]
We introduce an iterative bootstrapping framework for QA data augmentation (named QASnowball)
QASnowball can iteratively generate large-scale high-quality QA data based on a seed set of supervised examples.
We conduct experiments in the high-resource English scenario and the medium-resource Chinese scenario, and the experimental results show that the data generated by QASnowball can facilitate QA models.
arXiv Detail & Related papers (2023-09-19T05:20:36Z) - Generative Long-form Question Answering: Relevance, Faithfulness and
Succinctness [9.770663160391287]
Long Form Question Answering (LFQA) aims to generate an in-depth, paragraph-length answer for a given question.
Few works have been done to build an effective LFQA system.
We pioneered the research direction to improve the answer quality in terms of 1) query-relevance, 2) answer faithfulness, and 3) answer succinctness.
arXiv Detail & Related papers (2022-11-15T18:36:01Z) - RealTime QA: What's the Answer Right Now? [137.04039209995932]
We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis.
We build strong baseline models upon large pretrained language models, including GPT-3 and T5.
GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer.
arXiv Detail & Related papers (2022-07-27T07:26:01Z) - Read before Generate! Faithful Long Form Question Answering with Machine
Reading [77.17898499652306]
Long-form question answering (LFQA) aims to generate a paragraph-length answer for a given question.
We propose a new end-to-end framework that jointly models answer generation and machine reading.
arXiv Detail & Related papers (2022-03-01T10:41:17Z) - New Methods & Metrics for LFQA tasks [0.0]
Long-form question answering tasks require retrieving the documents pertinent to a query, using them to form a paragraph-length answer.
This work addresses the train/validation/test dataset overlap, absence of automatic metrics and generated answers not being "grounded" in retrieved documents.
arXiv Detail & Related papers (2021-12-26T18:38:05Z) - Reinforcement Learning for Abstractive Question Summarization with
Question-aware Semantic Rewards [20.342580435464072]
We introduce a reinforcement learning-based framework for abstractive question summarization.
We propose two novel rewards obtained from the downstream tasks of (i) question-type identification and (ii) question-focus recognition.
These rewards ensure the generation of semantically valid questions and encourage the inclusion of key medical entities/foci in the question summary.
arXiv Detail & Related papers (2021-07-01T02:06:46Z) - Challenges in Information-Seeking QA: Unanswerable Questions and
Paragraph Retrieval [46.3246135936476]
We analyze why answering information-seeking queries is more challenging and where their prevalent unanswerabilities arise.
Our controlled experiments suggest two headrooms -- paragraph selection and answerability prediction.
We manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer.
arXiv Detail & Related papers (2020-10-22T17:48:17Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.