QuALITY: Question Answering with Long Input Texts, Yes!
- URL: http://arxiv.org/abs/2112.08608v1
- Date: Thu, 16 Dec 2021 04:14:38 GMT
- Title: QuALITY: Question Answering with Long Input Texts, Yes!
- Authors: Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia,
Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He
He, Samuel R. Bowman
- Abstract summary: We introduce QuALITY, a dataset with context passages in English that have an average length of about 5,000 tokens.
Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage.
Only half of the questions are answerable by annotators working under tight time constraints.
- Score: 27.700792723226524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To enable building and testing models on long-document comprehension, we
introduce QuALITY, a multiple-choice QA dataset with context passages in
English that have an average length of about 5,000 tokens, much longer than
typical current models can process. Unlike in prior work with passages, our
questions are written and validated by contributors who have read the entire
passage, rather than relying on summaries or excerpts. In addition, only half
of the questions are answerable by annotators working under tight time
constraints, indicating that skimming and simple search are not enough to
consistently perform well. Current models perform poorly on this task (55.4%)
and significantly lag behind human performance (93.5%).
Related papers
- BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack [4.3482088816575155]
We introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in long documents.
BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets.
Our evaluations show that popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity.
arXiv Detail & Related papers (2024-06-14T16:00:29Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
NovelQA is a benchmark designed to test the capabilities of Large Language Models with extended texts.
This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types.
Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance.
arXiv Detail & Related papers (2024-03-18T17:32:32Z) - Training With "Paraphrasing the Original Text" Improves Long-Context Performance [19.48556587305737]
Large Language Models (LLMs) continue to evolve, more are being designed to handle long-context inputs.
We propose a novel approach to design training data for long-context tasks, aiming at augmenting LLMs' proficiency in extracting key information from long context.
Experimenting on LongBench and NaturalQuestions Multi-document-QA dataset with models of Llama and Qwen series, our method achieves an improvement of up to 8.48% and 4.48% in average scores.
arXiv Detail & Related papers (2023-12-18T13:40:16Z) - Harnessing the Power of Prompt-based Techniques for Generating
School-Level Questions using Large Language Models [0.5459032912385802]
We propose a novel approach that utilizes prompt-based techniques to generate descriptive and reasoning-based questions.
We curate a new QG dataset called EduProbe for school-level subjects, by leveraging the rich content of NCERT textbooks.
We investigate several prompt-based QG methods by fine-tuning transformer-based large language models.
arXiv Detail & Related papers (2023-12-02T05:13:28Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z) - LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding [58.20031627237889]
LongBench is the first bilingual, multi-task benchmark for long context understanding.
It comprises 21 datasets across 6 task categories in both English and Chinese, with an average length of 6,711 words (English) and 13,386 characters (Chinese)
arXiv Detail & Related papers (2023-08-28T11:53:40Z) - NarrativeXL: A Large-scale Dataset For Long-Term Memory Models [0.0]
Using GPT 3.5, we summarized each scene in 1,500 hand-curated fiction books from Project Gutenberg.
With 990,595 total questions, our dataset is an order of magnitude larger than the closest alternatives.
Most questions have a known retention demand'', indicating how long-term of a memory is needed to answer them.
arXiv Detail & Related papers (2023-05-23T09:55:32Z) - IIRC: A Dataset of Incomplete Information Reading Comprehension
Questions [53.3193258414806]
We present a dataset, IIRC, with more than 13K questions over paragraphs from English Wikipedia.
The questions were written by crowd workers who did not have access to any of the linked documents.
We follow recent modeling work on various reading comprehension datasets to construct a baseline model for this dataset.
arXiv Detail & Related papers (2020-11-13T20:59:21Z) - TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions [91.85730323228833]
We introduce TORQUE, a new English reading comprehension benchmark built on 3.2k news with 21k human-generated questions querying temporal relationships.
Results show that RoBERTa-large snippets achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance.
arXiv Detail & Related papers (2020-05-01T06:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.