Related papers: A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing

A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing

URL: http://arxiv.org/abs/2310.08433v1
Date: Thu, 12 Oct 2023 15:56:24 GMT
Title: A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing
Authors: Carlos G\'omez-Rodr\'iguez and Paul Williams
Abstract summary: We evaluate recent LLMs on English creative writing, a challenging and complex task that requires imagination, coherence, and style. We ask several LLMs and humans to write such a story and conduct a human evalution involving various criteria such as originality, humor, and style. Our results show that some state-of-the-art commercial LLMs match or slightly outperform our writers in most dimensions; whereas open-source LLMs lag behind.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We evaluate a range of recent LLMs on English creative writing, a challenging and complex task that requires imagination, coherence, and style. We use a difficult, open-ended scenario chosen to avoid training data reuse: an epic narration of a single combat between Ignatius J. Reilly, the protagonist of the Pulitzer Prize-winning novel A Confederacy of Dunces (1980), and a pterodactyl, a prehistoric flying reptile. We ask several LLMs and humans to write such a story and conduct a human evalution involving various criteria such as fluency, coherence, originality, humor, and style. Our results show that some state-of-the-art commercial LLMs match or slightly outperform our writers in most dimensions; whereas open-source LLMs lag behind. Humans retain an edge in creativity, while humor shows a binary divide between LLMs that can handle it comparably to humans and those that fail at it. We discuss the implications and limitations of our study and suggest directions for future research.

Related papers

Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback [57.200668979963694]
We present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues.<n>We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics.
arXiv Detail & Related papers (2025-07-21T18:56:50Z)
Can LLMs Generate Good Stories? Insights and Challenges from a Narrative Planning Perspective [5.164206868073554]
We present a benchmark for evaluating Large Language Models (LLMs) on narrative planning based on literature examples.<n>Our experiments show that GPT-4 tier LLMs can generate causally sound stories at small scales, but planning with character intentionality and dramatic conflict remains challenging.
arXiv Detail & Related papers (2025-06-11T20:27:08Z)
Towards Enhanced Immersion and Agency for LLM-based Interactive Drama [55.770617779283064]
This paper begins with understanding interactive drama from two aspects: Immersion, the player's feeling of being present in the story, and Agency. To enhance these two aspects, we first propose Playwriting-guided Generation, a novel method that helps LLMs craft dramatic stories with substantially improved structures and narrative quality.
arXiv Detail & Related papers (2025-02-25T06:06:16Z)
A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking. LoTbench is an interactive, causality-aware evaluation framework. Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z)
Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs [26.682827310724363]
We examine two state-of-the-art large language models (LLMs) on story generation. We find that LLM-generated stories often consist of plot elements that are echoed across a number of generations. We introduce the Sui Generis score, which estimates how unlikely a plot element is to appear in alternative storylines.
arXiv Detail & Related papers (2024-12-31T04:54:48Z)
Large Language Models show both individual and collective creativity comparable to humans [39.90254321453145]
Large Language Models (LLMs) show creativity comparable to humans. We benchmark the LLMs against individual humans, and also take a novel approach by comparing them to the collective creativity of groups of humans. When questioned 10 times, an LLM's collective creativity is equivalent to 8-10 humans.
arXiv Detail & Related papers (2024-12-04T09:18:54Z)
Evaluating Creative Short Story Generation in Humans and Large Language Models [0.7965327033045846]
Large language models (LLMs) have recently demonstrated the ability to generate high-quality stories. We conduct a systematic analysis of creativity in short story generation across LLMs and everyday people. Our findings reveal that while LLMs can generate stylistically complex stories, they tend to fall short in terms of creativity when compared to average human writers.
arXiv Detail & Related papers (2024-11-04T17:40:39Z)
Large Language Models Reflect the Ideology of their Creators [73.25935570218375]
Large language models (LLMs) are trained on vast amounts of data to generate natural language. We uncover notable diversity in the ideological stance exhibited across different LLMs and languages.
arXiv Detail & Related papers (2024-10-24T04:02:30Z)
Assessing Language Models' Worldview for Fiction Generation [0.0]
This study investigates the ability of Large Language Models to maintain a state of world essential to generate fiction. We find that only two models exhibit consistent worldview, while the rest are self-conflicting. This uniformity across models further suggests a lack of state' necessary for fiction.
arXiv Detail & Related papers (2024-08-15T03:19:41Z)
Are Large Language Models Capable of Generating Human-Level Narratives? [114.34140090869175]
This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression. We introduce a novel computational framework to analyze narratives through three discourse-level aspects. We show that explicit integration of discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling.
arXiv Detail & Related papers (2024-07-18T08:02:49Z)
Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing? [0.8999666725996975]
Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks. We have carried out a contest between Patricio Pron and GPT-4, in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. The results indicate that LLMs are still far from challenging a top human creative writer.
arXiv Detail & Related papers (2024-07-01T09:28:58Z)
The Unlikely Duel: Evaluating Creative Writing in LLMs through a Unique Scenario [12.852843553759744]
We evaluate recent state-of-the-art, instruction-tuned large language models (LLMs) on an English creative writing task. We use a specifically-tailored prompt (based on an epic combat between Ignatius J. Reilly and a pterodactyl) to minimize the risk of training data leakage. evaluation is performed by humans using a detailed rubric including various aspects like fluency, style, originality or humor.
arXiv Detail & Related papers (2024-06-22T17:01:59Z)
HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing [45.95600225239927]
Large language models (LLMs) can hardly produce written works at the level of human experts due to the extremely high complexity of literature writing. We present HoLLMwood, an automated framework for unleashing the creativity of LLMs and exploring their potential in screenwriting.
arXiv Detail & Related papers (2024-06-17T16:01:33Z)
LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z)
AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models. It achieves consistent and correct step-wise prompts in zero-shot scenarios. We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z)
In-Context Impersonation Reveals Large Language Models' Strengths and Biases [56.61129643802483]
We ask LLMs to assume different personas before solving vision and language tasks. We find that LLMs pretending to be children of different ages recover human-like developmental stages. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts.
arXiv Detail & Related papers (2023-05-24T09:13:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.