Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation
- URL: http://arxiv.org/abs/2406.00179v1
- Date: Fri, 31 May 2024 20:15:10 GMT
- Title: Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation
- Authors: Bernd Bohnet, Kevin Swersky, Rosanne Liu, Pranjal Awasthi, Azade Nova, Javier Snaider, Hanie Sedghi, Aaron T Parisi, Michael Collins, Angeliki Lazaridou, Orhan Firat, Noah Fiedel,
- Abstract summary: We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
- Score: 65.16137964758612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Previous efforts to construct such datasets relied on crowd-sourcing, but the emergence of transformers with a context size of 1 million or more tokens now enables entirely automatic approaches. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text, such as questions involving character arcs, broader themes, or the consequences of early actions later in the story. We propose a holistic pipeline for automatic data generation including question generation, answering, and model scoring using an ``Evaluator''. We find that a relative approach, comparing answers between models in a pairwise fashion and ranking with a Bradley-Terry model, provides a more consistent and differentiating scoring mechanism than an absolute scorer that rates answers individually. We also show that LLMs from different model families produce moderate agreement in their ratings. We ground our approach using the manually curated NarrativeQA dataset, where our evaluator shows excellent agreement with human judgement and even finds errors in the dataset. Using our automatic evaluation approach, we show that using an entire book as context produces superior reading comprehension performance compared to baseline no-context (parametric knowledge only) and retrieval-based approaches.
Related papers
- RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries.
This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z) - Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity [0.0]
AugCon is capable of automatically generating context-driven SFT data across multiple levels of granularity with high diversity, quality and fidelity.
We train a scorer through contrastive learning to collaborate with CST to rank and refine queries.
The results highlight the significant advantages of AugCon in producing high diversity, quality, and fidelity SFT data against several state-of-the-art methods.
arXiv Detail & Related papers (2024-05-26T14:14:18Z) - Assisting humans in complex comparisons: automated information comparison at scale [0.0]
Abstractive Summarization & Criteria-driven Comparison (ASC$2$End) system developed.
Our system employs Semantic Text Similarity comparisons for generating evidence-supported analyses.
Prompts were designed using zero-shot strategies to contextualize information for improved model reasoning.
arXiv Detail & Related papers (2024-04-05T18:44:54Z) - Short Answer Grading Using One-shot Prompting and Text Similarity
Scoring Model [2.14986347364539]
We developed an automated short answer grading model that provided both analytic scores and holistic scores.
The accuracy and quadratic weighted kappa of our model were 0.67 and 0.71 on a subset of the publicly available ASAG dataset.
arXiv Detail & Related papers (2023-05-29T22:05:29Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Read before Generate! Faithful Long Form Question Answering with Machine
Reading [77.17898499652306]
Long-form question answering (LFQA) aims to generate a paragraph-length answer for a given question.
We propose a new end-to-end framework that jointly models answer generation and machine reading.
arXiv Detail & Related papers (2022-03-01T10:41:17Z) - Meeting Summarization with Pre-training and Clustering Methods [6.47783315109491]
HMNetcitehmnet is a hierarchical network that employs both a word-level transformer and a turn-level transformer, as the baseline.
We extend the locate-then-summarize approach of QMSumciteqmsum with an intermediate clustering step.
We compare the performance of our baseline models with BART, a state-of-the-art language model that is effective for summarization.
arXiv Detail & Related papers (2021-11-16T03:14:40Z) - $Q^{2}$: Evaluating Factual Consistency in Knowledge-Grounded Dialogues
via Question Generation and Question Answering [38.951535576102906]
We propose an automatic evaluation metric for factual consistency in knowledge-grounded dialogue models.
Our metric makes use of co-reference resolution and natural language inference capabilities.
We curate a novel dataset of state-of-the-art dialogue system outputs for the Wizard-of-Wikipedia dataset.
arXiv Detail & Related papers (2021-04-16T16:21:16Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.