Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective
- URL: http://arxiv.org/abs/2506.10161v1
- Date: Wed, 11 Jun 2025 20:27:08 GMT
- Title: Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective
- Authors: Yi Wang, Max Kreminski,
- Abstract summary: We present a benchmark for evaluating Large Language Models (LLMs) on narrative planning based on literature examples.<n>Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging.
- Score: 5.164206868073554
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Story generation has been a prominent application of Large Language Models (LLMs). However, understanding LLMs' ability to produce high-quality stories remains limited due to challenges in automatic evaluation methods and the high cost and subjectivity of manual evaluation. Computational narratology offers valuable insights into what constitutes a good story, which has been applied in the symbolic narrative planning approach to story generation. This work aims to deepen the understanding of LLMs' story generation capabilities by using them to solve narrative planning problems. We present a benchmark for evaluating LLMs on narrative planning based on literature examples, focusing on causal soundness, character intentionality, and dramatic conflict. Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging, requiring LLMs trained with reinforcement learning for complex reasoning. The results offer insights on the scale of stories that LLMs can generate while maintaining quality from different aspects. Our findings also highlight interesting problem solving behaviors and shed lights on challenges and considerations for applying LLM narrative planning in game environments.
Related papers
- Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection [35.550137361809405]
Plot hole detection in stories is a proxy to evaluate language understanding and reasoning in Large Language Models.<n>We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories.<n>We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed.
arXiv Detail & Related papers (2025-04-16T09:25:54Z) - If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z) - How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives [0.0]
A rapidly developing application of LLMs in XAI is to convert quantitative explanations into user-friendly narratives.<n>We propose a framework and explore several automated metrics to evaluate LLM-generated narratives.
arXiv Detail & Related papers (2024-12-13T15:45:45Z) - Multiverse of Greatness: Generating Story Branches with LLMs [0.6283043694426244]
This paper presents Dynamic Context Prompting/Programming (DCP/P), a novel framework for interacting with LLMs to generate graph-based content with a dynamic context window history.
We evaluate DCP/P against our baseline, which does not provide context history to an LLM and only relies on the initial story data.
We qualitatively examine the quality of the objectively best-performing generated game from each approach.
arXiv Detail & Related papers (2024-11-22T02:11:37Z) - Are Large Language Models Capable of Generating Human-Level Narratives? [114.34140090869175]
This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression.
We introduce a novel computational framework to analyze narratives through three discourse-level aspects.
We show that explicit integration of discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling.
arXiv Detail & Related papers (2024-07-18T08:02:49Z) - Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation [15.718288693929019]
Large Language Models (LLM) achieve state-of-the-art performance on many NLP tasks.
We study whether LLMs can be used as substitutes for human annotators.
We find that LLMs outperform current automatic measures for system-level evaluation but still struggle to provide satisfactory explanations.
arXiv Detail & Related papers (2024-05-22T15:56:52Z) - LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries.
We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions.
We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.<n>It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.<n>Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - UNDIAL: Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models [12.45822383965784]
We introduce UnDIAL (Unlearning via Self-Distillation on Adjusted Logits), a novel and robust unlearning method.
Our approach leverages self-distillation to adjust logits and selectively reduce the influence of targeted tokens.
arXiv Detail & Related papers (2024-02-15T16:21:14Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.