Related papers: PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts

URL: http://arxiv.org/abs/2508.09848v2
Date: Thu, 14 Aug 2025 02:08:15 GMT
Title: PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts
Authors: Mo Yu, Tsz Ting Chung, Chulun Zhou, Tong Li, Rui Lu, Jiangnan Li, Liyan Xu, Haoshu Lu, Ning Zhang, Jing Li, Jie Zhou,
Abstract summary: We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book.<n>Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related.<n> Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by
Score: 50.77454873238667
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce PRELUDE, a benchmark for evaluating long-context understanding through the task of determining whether a character's prequel story is consistent with the canonical narrative of the original book. Our task poses a stronger demand for global comprehension and deep reasoning than existing benchmarks -- as the prequels are not part of the original story, assessing their plausibility typically requires searching and integrating information that is only indirectly related. Empirically, 88% of instances require evidence from multiple parts of the narrative. Experimental results highlight the challenge of our task: in-context learning, RAG and in-domain training with state-of-the-art LLMs, and commercial DeepResearch services, lag behind humans by >15%. A further human study reveals that models often produce correct answers with flawed reasoning, leading to an over 30% gap in reasoning accuracy compared to humans. These findings underscore the substantial room for improvement in long-context understanding and reasoning.

Related papers

Boule or Baguette? A Study on Task Topology, Length Generalization, and the Benefit of Reasoning Traces [31.37944377681284]
We use PITA, a dataset of over 23 million statements in propositional logic and their corresponding proofs.<n>We find that RT models generalize well on broad and shallow subsets, while deteriorating on narrow and deep subsets relative to non-RT baselines.<n>Our resulting theory suggests fundamental scalings that limit how well RT models perform on deep tasks, and highlights their generalization strengths on broad tasks.
arXiv Detail & Related papers (2026-02-16T02:20:37Z)
MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables [50.29407048003165]
We present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature.<n>The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering.<n>Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning.
arXiv Detail & Related papers (2025-09-15T19:06:10Z)
Can Large Language Models Infer Causal Relationships from Real-World Text? [2.602939322715435]
This paper investigates whether large language models (LLMs) are capable of inferring causal relationships from real-world texts.<n>To the best of our knowledge, our benchmark is the first-ever real-world dataset for this task.
arXiv Detail & Related papers (2025-05-25T01:50:05Z)
Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection [35.550137361809405]
Plot hole detection in stories is a proxy to evaluate language understanding and reasoning in Large Language Models.<n>We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories.<n>We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed.
arXiv Detail & Related papers (2025-04-16T09:25:54Z)
NeedleBench: Evaluating LLM Retrieval and Reasoning Across Varying Information Densities [51.07379913779232]
NeedleBench is a framework for assessing retrieval and reasoning performance in long-context tasks.<n>It embeds key data points at varying depths to rigorously test model capabilities.<n>Our experiments reveal that reasoning models like Deep-R1 and OpenAI's o3 struggle with continuous retrieval and reasoning in information-dense scenarios.
arXiv Detail & Related papers (2024-07-16T17:59:06Z)
Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP [32.19010113355365]
We argue that conflating different tasks by their context length is unproductive.<n>We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts.<n>We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored.
arXiv Detail & Related papers (2024-06-29T11:09:47Z)
One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books. Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z)
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z)
JECC: Commonsense Reasoning Tasks Derived from Interactive Fictions [75.42526766746515]
We propose a new commonsense reasoning dataset based on human's Interactive Fiction (IF) gameplay walkthroughs. Our dataset focuses on the assessment of functional commonsense knowledge rules rather than factual knowledge. Experiments show that the introduced dataset is challenging to previous machine reading models as well as the new large language models.
arXiv Detail & Related papers (2022-10-18T19:20:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.