Related papers: LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

URL: http://arxiv.org/abs/2510.13494v1
Date: Wed, 15 Oct 2025 12:43:59 GMT
Title: LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA
Authors: Tommaso Bonomo, Luca Gioffré, Roberto Navigli,
Abstract summary: We introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works.<n>We identify and correct low-quality QA samples while removing extraneous text from source documents.<n>We benchmark a set of long-context LLMs on LiteraryQA.
Score: 35.323445529050275
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/SapienzaNLP/LiteraryQA.

Related papers

AURA Score: A Metric For Holistic Audio Question Answering Evaluation [57.042210272137396]
We introduce AQEval to enable systematic benchmarking of AQA metrics.<n>It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance.<n>Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment.<n>Third, we propose a new metric - AURA score, to better evaluate open-ended model responses.
arXiv Detail & Related papers (2025-10-06T15:41:34Z)
Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts [19.640586886024952]
HAMLET is a framework for evaluating the long-context comprehension of large language models.<n>It structures texts into a three-level key-fact hierarchy at root, branch, and leaf-levels.<n>It employs query-focused summarization to evaluate how well models recall and faithfully represent information at each level.
arXiv Detail & Related papers (2025-08-27T05:23:22Z)
Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization [34.2456005415483]
We tackle the problem of evaluating faithfulness of a generated summary given its source document. We find that current models exhibit a trade-off between abstractiveness and faithfulness. We propose an automatic question answering (QA) based metric for faithfulness.
arXiv Detail & Related papers (2020-05-07T21:00:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.