Literary Evidence Retrieval via Long-Context Language Models
- URL: http://arxiv.org/abs/2506.03090v1
- Date: Tue, 03 Jun 2025 17:19:45 GMT
- Title: Literary Evidence Retrieval via Long-Context Language Models
- Authors: Katherine Thai, Mohit Iyyer,
- Abstract summary: How well do modern long-context language models understand literary fiction?<n>We build a benchmark where the entire text of a primary source is provided to an LLM alongside literary criticism with a missing quotation from that work.<n>This setting mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination.
- Score: 39.174955595897366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.
Related papers
- Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language Models [2.7323591332394166]
GLASS (Greimas Literary Analysis via Semiotic Square) is a structured analytical framework based on Greimas Semiotic Square (GSS)<n> GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works.<n>This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.
arXiv Detail & Related papers (2025-06-26T15:10:24Z) - Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary Themes [9.471374217162843]
We propose Retell, a simple, accessible topic modeling approach for literature.<n>We prompt resource-efficient, generative language models (LMs) to tell what passages show.
arXiv Detail & Related papers (2025-05-29T06:59:21Z) - Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition [2.048226951354646]
Large language models (LLMs) have emerged as a potential solution to automate the complex processes involved in writing literature reviews.<n>This study introduces a framework to automatically evaluate the performance of LLMs in three key tasks of literature writing.
arXiv Detail & Related papers (2024-12-18T08:42:25Z) - A Bayesian Approach to Harnessing the Power of LLMs in Authorship Attribution [57.309390098903]
Authorship attribution aims to identify the origin or author of a document.
Large Language Models (LLMs) with their deep reasoning capabilities and ability to maintain long-range textual associations offer a promising alternative.
Our results on the IMDb and blog datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors.
arXiv Detail & Related papers (2024-10-29T04:14:23Z) - Says Who? Effective Zero-Shot Annotation of Focalization [0.0]
Focalization, the perspective through which narrative is presented, is encoded via a wide range of lexico-grammatical features.<n>Even trained annotators frequently disagree on correct labels, suggesting this task is both qualitatively and computationally challenging.<n>Despite the challenging nature of the task, we find that LLMs show comparable performance to trained human annotators, with GPT-4o achieving an average F1 of 84.79%.
arXiv Detail & Related papers (2024-09-17T17:50:15Z) - One Thousand and One Pairs: A "novel" challenge for long-context language models [56.60667988954638]
NoCha is a dataset of 1,001 pairs of true and false claims about 67 fictional books.
Our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify.
On average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning.
arXiv Detail & Related papers (2024-06-24T02:03:57Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - Reading Subtext: Evaluating Large Language Models on Short Story Summarization with Writers [25.268709339109893]
We evaluate recent Large Language Models (LLMs) on the challenging task of summarizing short stories.
We work directly with authors to ensure that the stories have not been shared online (and therefore are unseen by the models)
We compare GPT-4, Claude-2.1, and LLama-2-70B and find that all three models make faithfulness mistakes in over 50% of summaries.
arXiv Detail & Related papers (2024-03-02T01:52:14Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - "Kelly is a Warm Person, Joseph is a Role Model": Gender Biases in
LLM-Generated Reference Letters [97.11173801187816]
Large Language Models (LLMs) have recently emerged as an effective tool to assist individuals in writing various types of content.
This paper critically examines gender biases in LLM-generated reference letters.
arXiv Detail & Related papers (2023-10-13T16:12:57Z) - RELIC: Retrieving Evidence for Literary Claims [29.762552250403544]
We use a large-scale dataset of 78K literary quotations to formulate the novel task of literary evidence retrieval.
We implement a RoBERTa-based dense passage retriever for this task that outperforms existing pretrained information retrieval baselines.
arXiv Detail & Related papers (2022-03-18T16:56:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.