RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering
- URL: http://arxiv.org/abs/2210.14353v1
- Date: Tue, 25 Oct 2022 21:39:36 GMT
- Title: RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering
- Authors: Victor Zhong, Weijia Shi, Wen-tau Yih, Luke Zettlemoyer
- Abstract summary: We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
- Score: 87.18962441714976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce RoMQA, the first benchmark for robust, multi-evidence,
multi-answer question answering (QA). RoMQA contains clusters of questions that
are derived from related constraints mined from the Wikidata knowledge graph.
RoMQA evaluates robustness of QA models to varying constraints by measuring
worst-case performance within each question cluster. Compared to prior QA
datasets, RoMQA has more human-written questions that require reasoning over
more evidence text and have, on average, many more correct answers. In
addition, human annotators rate RoMQA questions as more natural or likely to be
asked by people. We evaluate state-of-the-art large language models in
zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is
challenging: zero-shot and few-shot models perform similarly to naive
baselines, while supervised retrieval methods perform well below gold evidence
upper bounds. Moreover, existing models are not robust to variations in
question constraints, but can be made more robust by tuning on clusters of
related questions. Our results show that RoMQA is a challenging benchmark for
large language models, and provides a quantifiable test to build more robust QA
methods.
Related papers
- Diversity Enhanced Narrative Question Generation for Storybooks [4.043005183192124]
We introduce a multi-question generation model (mQG) capable of generating multiple, diverse, and answerable questions.
To validate the answerability of the generated questions, we employ a SQuAD2.0 fine-tuned question answering model.
mQG shows promising results across various evaluation metrics, among strong baselines.
arXiv Detail & Related papers (2023-10-25T08:10:04Z) - MarkQA: A large scale KBQA dataset with numerical reasoning [11.072552105311484]
We propose a new task, NR-KBQA, which requires the ability to perform both multi-hop reasoning and numerical reasoning.
We design a logic form in Python format called PyQL to represent the reasoning process of numerical reasoning questions.
We present a large dataset called MarkQA, which is automatically constructed from a small set of seeds.
arXiv Detail & Related papers (2023-10-24T04:50:59Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - An Empirical Comparison of LM-based Question and Answer Generation
Methods [79.31199020420827]
Question and answer generation (QAG) consists of generating a set of question-answer pairs given a context.
In this paper, we establish baselines with three different QAG methodologies that leverage sequence-to-sequence language model (LM) fine-tuning.
Experiments show that an end-to-end QAG model, which is computationally light at both training and inference times, is generally robust and outperforms other more convoluted approaches.
arXiv Detail & Related papers (2023-05-26T14:59:53Z) - Toward Unsupervised Realistic Visual Question Answering [70.67698100148414]
We study the problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs)
We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training.
We propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs.
This combines pseudo UQs obtained by randomly pairing images and questions, with an
arXiv Detail & Related papers (2023-03-09T06:58:29Z) - How to Build Robust FAQ Chatbot with Controllable Question Generator? [5.680871239968297]
We propose a high-quality, diverse, controllable method to generate adversarial samples with a semantic graph.
The fluent and semantically generated QA pairs fool our passage retrieval model successfully.
We find that the generated data set improves the generalizability of the QA model to the new target domain.
arXiv Detail & Related papers (2021-11-18T12:54:07Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.