Related papers: Assessing Language Models' Worldview for Fiction Generation

Assessing Language Models' Worldview for Fiction Generation

URL: http://arxiv.org/abs/2408.07904v1
Date: Thu, 15 Aug 2024 03:19:41 GMT
Title: Assessing Language Models' Worldview for Fiction Generation
Authors: Aisha Khatun, Daniel G. Brown,
Abstract summary: This study investigates the ability of Large Language Models to maintain a state of world essential to generate fiction. We find that only two models exhibit consistent worldview, while the rest are self-conflicting. This uniformity across models further suggests a lack of state' necessary for fiction.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of Large Language Models (LLMs) has become ubiquitous, with abundant applications in computational creativity. One such application is fictional story generation. Fiction is a narrative that occurs in a story world that is slightly different than ours. With LLMs becoming writing partners, we question how suitable they are to generate fiction. This study investigates the ability of LLMs to maintain a state of world essential to generate fiction. Through a series of questions to nine LLMs, we find that only two models exhibit consistent worldview, while the rest are self-conflicting. Subsequent analysis of stories generated by four models revealed a strikingly uniform narrative pattern. This uniformity across models further suggests a lack of `state' necessary for fiction. We highlight the limitations of current LLMs in fiction writing and advocate for future research to test and create story worlds for LLMs to reside in. All code, dataset, and the generated responses can be found in https://github.com/tanny411/llm-reliability-and-consistency-evaluation.

Related papers

Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective [5.164206868073554]
We present a benchmark for evaluating Large Language Models (LLMs) on narrative planning based on literature examples.<n>Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging.
arXiv Detail & Related papers (2025-06-11T20:27:08Z)
Finding Flawed Fictions: Evaluating Complex Reasoning in Language Models via Plot Hole Detection [35.550137361809405]
Plot hole detection in stories is a proxy to evaluate language understanding and reasoning in Large Language Models. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed.
arXiv Detail & Related papers (2025-04-16T09:25:54Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs) Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs [26.682827310724363]
We examine two state-of-the-art large language models (LLMs) on story generation. We find that LLM-generated stories often consist of plot elements that are echoed across a number of generations. We introduce the Sui Generis score, which estimates how unlikely a plot element is to appear in alternative storylines.
arXiv Detail & Related papers (2024-12-31T04:54:48Z)
LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z)
The Unlikely Duel: Evaluating Creative Writing in LLMs through a Unique Scenario [12.852843553759744]
We evaluate recent state-of-the-art, instruction-tuned large language models (LLMs) on an English creative writing task. We use a specifically-tailored prompt (based on an epic combat between Ignatius J. Reilly and a pterodactyl) to minimize the risk of training data leakage. evaluation is performed by humans using a detailed rubric including various aspects like fluency, style, originality or humor.
arXiv Detail & Related papers (2024-06-22T17:01:59Z)
LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z)
Word2World: Generating Stories and Worlds through Large Language Models [5.80330969550483]
Large Language Models (LLMs) have proven their worth across a diverse spectrum of disciplines. This work introduces Word2World, a system that enables LLMs to procedurally design playable games through stories.
arXiv Detail & Related papers (2024-05-06T14:21:52Z)
Creating Suspenseful Stories: Iterative Planning with Large Language Models [2.6923151107804055]
We propose a novel iterative-prompting-based planning method that is grounded in two theoretical foundations of story suspense. To the best of our knowledge, this paper is the first attempt at suspenseful story generation with large language models.
arXiv Detail & Related papers (2024-02-27T01:25:52Z)
Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks. The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human. These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models. It achieves consistent and correct step-wise prompts in zero-shot scenarios. We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z)
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing [0.0]
We evaluate recent LLMs on English creative writing, a challenging and complex task that requires imagination, coherence, and style. We ask several LLMs and humans to write such a story and conduct a human evalution involving various criteria such as originality, humor, and style. Our results show that some state-of-the-art commercial LLMs match or slightly outperform our writers in most dimensions; whereas open-source LLMs lag behind.
arXiv Detail & Related papers (2023-10-12T15:56:24Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.