Related papers: KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

URL: http://arxiv.org/abs/2505.09825v2
Date: Tue, 03 Jun 2025 15:11:26 GMT
Title: KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning
Authors: Peiqi Sui, Juan Diego Rodriguez, Philippe Laban, Dean Murphy, Joseph P. Dexter, Richard Jean So, Samuel Baker, Pramit Chaudhuri,
Abstract summary: KRISTEVA is the first close reading benchmark for evaluating interpretive reasoning.<n>It consists of 1331 multiple-choice questions adapted from classroom data.<n>Our results find that while state-of-the-art LLMs possess some college-level close reading competency, their performances still trail those of experienced human evaluators on 10 out of 11 tasks.
Score: 9.927958243208952
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs may seem to understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that, while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% - 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.

Related papers

MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs [56.87573414161703]
We introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark to assess Large Language Models (LLMs)<n>MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance.<n>For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English.
arXiv Detail & Related papers (2025-07-23T12:56:31Z)
The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing [1.3654846342364306]
We use five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers) to extract 17 reference-less textual features.<n>We model individual reader preferences, deriving feature importance vectors that reflect their textual priorities.<n>Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader's preferences.
arXiv Detail & Related papers (2025-06-03T18:50:22Z)
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs [50.0874045899661]
We introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character.<n>Using Lu Xun as a case study, we propose four training tasks derived from his 17 essay collections.<n>These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks.<n>We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics.
arXiv Detail & Related papers (2025-02-18T16:11:54Z)
Can AI grade your essays? A comparative analysis of large language models and teacher ratings in multidimensional essay scoring [8.71931996488953]
Large language models offer potential solutions to facilitate essay-scoring tasks for teachers. Recent developments in generative AI, such as large language models, offer potential solutions to facilitate essay-scoring tasks for teachers. We evaluate the performance and reliability of both open-source and closed-source LLMs in assessing German student essays.
arXiv Detail & Related papers (2024-11-25T12:33:14Z)
DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels [86.93099925711388]
We propose textbfDetectiveQA, a dataset specifically designed for narrative reasoning within long contexts.<n>We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English.
arXiv Detail & Related papers (2024-09-04T06:28:22Z)
LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z)
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens [63.7488938083696]
We introduce NovelQA, a benchmark tailored for evaluating Large Language Models (LLMs) with complex, extended narratives.<n>NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding.<n>Our evaluation of long-context LLMs on NovelQA reveals significant insights into their strengths and weaknesses.
arXiv Detail & Related papers (2024-03-18T17:32:32Z)
Can Large Language Models Identify Authorship? [16.35265384114857]
Large Language Models (LLMs) have demonstrated an exceptional capacity for reasoning and problem-solving. This work seeks to address three research questions: (1) Can LLMs perform zero-shot, end-to-end authorship verification effectively? (2) Are LLMs capable of accurately attributing authorship among multiple candidates authors (e.g., 10 and 20)?
arXiv Detail & Related papers (2024-03-13T03:22:02Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)
Natural Language Inference in Context -- Investigating Contextual Reasoning over Long Texts [19.894104911338353]
ConTRoL is a new dataset for ConTextual Reasoning over Long texts. It consists of 8,325 expert-designed "context-hypothesis" pairs with gold labels. It is derived from competitive selection and recruitment test (verbal reasoning test) for police recruitment, with expert level quality.
arXiv Detail & Related papers (2020-11-10T02:31:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.