Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection
- URL: http://arxiv.org/abs/2504.11900v2
- Date: Fri, 18 Apr 2025 08:44:04 GMT
- Title: Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection
- Authors: Kabir Ahuja, Melanie Sclar, Yulia Tsvetkov,
- Abstract summary: Plot hole detection in stories is a proxy to evaluate language understanding and reasoning in Large Language Models.<n>We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories.<n>We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed.
- Score: 35.550137361809405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes -- inconsistencies in a storyline that break the internal logic or rules of a story's world -- requires nuanced reasoning skills, including tracking entities and events and their interplay, abstract thinking, pragmatic narrative understanding, commonsense and social reasoning, and theory of mind. As Large Language Models (LLMs) increasingly generate, interpret, and modify text, rigorously assessing their narrative consistency and deeper language understanding becomes critical. However, existing benchmarks focus mainly on surface-level comprehension. In this work, we propose plot hole detection in stories as a proxy to evaluate language understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. Using this algorithm, we construct a benchmark to evaluate LLMs' plot hole detection abilities in stories -- FlawedFictions -- , which is robust to contamination, with human filtering ensuring high quality. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed, with performance significantly degrading as story length increases. Finally, we show that LLM-based story summarization and story generation are prone to introducing plot holes, with more than 50% and 100% increases in plot hole detection rates with respect to human-written originals.
Related papers
- Learning to Reason for Long-Form Story Generation [98.273323001781]
We propose a general story-generation task (Next-Chapter Prediction) and a reward formulation (Verified Rewards via Completion Likelihood Improvement)<n>We learn to reason over a story's condensed information and generate a detailed plan for the next chapter.<n>Our reasoning is evaluated via the chapters it helps a story-generator create, and compared against non-trained and supervised finetuning (SFT) baselines.
arXiv Detail & Related papers (2025-03-28T18:48:26Z) - MLD-EA: Check and Complete Narrative Coherence by Introducing Emotions and Actions [8.06073345741722]
We introduce the Missing Logic Detector by Emotion and Action (MLD-EA) model.<n>It identifies narrative gaps and generates coherent sentences that integrate seamlessly with the story's emotional and logical flow.<n>This work fills a gap in NLP research and advances border goals of creating more sophisticated and reliable story-generation systems.
arXiv Detail & Related papers (2024-12-03T23:01:21Z) - Assessing Language Models' Worldview for Fiction Generation [0.0]
This study investigates the ability of Large Language Models to maintain a state of world essential to generate fiction.
We find that only two models exhibit consistent worldview, while the rest are self-conflicting.
This uniformity across models further suggests a lack of state' necessary for fiction.
arXiv Detail & Related papers (2024-08-15T03:19:41Z) - Are Large Language Models Capable of Generating Human-Level Narratives? [114.34140090869175]
This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression.
We introduce a novel computational framework to analyze narratives through three discourse-level aspects.
We show that explicit integration of discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling.
arXiv Detail & Related papers (2024-07-18T08:02:49Z) - Measuring Psychological Depth in Language Models [50.48914935872879]
We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM's ability to produce authentic and narratively complex stories.
We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff's alpha)
Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit.
arXiv Detail & Related papers (2024-06-18T14:51:54Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - Creating Suspenseful Stories: Iterative Planning with Large Language
Models [2.6923151107804055]
We propose a novel iterative-prompting-based planning method that is grounded in two theoretical foundations of story suspense.
To the best of our knowledge, this paper is the first attempt at suspenseful story generation with large language models.
arXiv Detail & Related papers (2024-02-27T01:25:52Z) - Few-Shot Character Understanding in Movies as an Assessment to
Meta-Learning of Theory-of-Mind [47.13015852330866]
Humans can quickly understand new fictional characters with a few observations, mainly by drawing analogies to fictional and real people they already know.
This reflects the few-shot and meta-learning essence of humans' inference of characters' mental states, i.e., theory-of-mind (ToM)
We fill this gap with a novel NLP dataset, ToM-in-AMC, the first assessment of machines' meta-learning of ToM in a realistic narrative understanding scenario.
arXiv Detail & Related papers (2022-11-09T05:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.