Related papers: The Unlikely Duel: Evaluating Creative Writing in LLMs through a Unique Scenario

The Unlikely Duel: Evaluating Creative Writing in LLMs through a Unique Scenario

URL: http://arxiv.org/abs/2406.15891v1
Date: Sat, 22 Jun 2024 17:01:59 GMT
Title: The Unlikely Duel: Evaluating Creative Writing in LLMs through a Unique Scenario
Authors: Carlos Gómez-Rodríguez, Paul Williams,
Abstract summary: We evaluate recent state-of-the-art, instruction-tuned large language models (LLMs) on an English creative writing task. We use a specifically-tailored prompt (based on an epic combat between Ignatius J. Reilly and a pterodactyl) to minimize the risk of training data leakage. evaluation is performed by humans using a detailed rubric including various aspects like fluency, style, originality or humor.
Score: 12.852843553759744
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This is a summary of the paper "A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing", which was published in Findings of EMNLP 2023. We evaluate a range of recent state-of-the-art, instruction-tuned large language models (LLMs) on an English creative writing task, and compare them to human writers. For this purpose, we use a specifically-tailored prompt (based on an epic combat between Ignatius J. Reilly, main character of John Kennedy Toole's "A Confederacy of Dunces", and a pterodactyl) to minimize the risk of training data leakage and force the models to be creative rather than reusing existing stories. The same prompt is presented to LLMs and human writers, and evaluation is performed by humans using a detailed rubric including various aspects like fluency, style, originality or humor. Results show that some state-of-the-art commercial LLMs match or slightly outperform our human writers in most of the evaluated dimensions. Open-source LLMs lag behind. Humans keep a close lead in originality, and only the top three LLMs can handle humor at human-like levels.

Related papers

Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback [57.200668979963694]
We present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues.<n>We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics.
arXiv Detail & Related papers (2025-07-21T18:56:50Z)
Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs [50.0874045899661]
We introduce CharacterBot, a model designed to replicate both the linguistic patterns and distinctive thought processes of a character. Using Lu Xun as a case study, we propose four training tasks derived from his 17 essay collections. These include a pre-training task focused on mastering external linguistic structures and knowledge, as well as three fine-tuning tasks. We evaluate CharacterBot on three tasks for linguistic accuracy and opinion comprehension, demonstrating that it significantly outperforms the baselines on our adapted metrics.
arXiv Detail & Related papers (2025-02-18T16:11:54Z)
Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs [26.682827310724363]
We examine two state-of-the-art large language models (LLMs) on story generation. We find that LLM-generated stories often consist of plot elements that are echoed across a number of generations. We introduce the Sui Generis score, which estimates how unlikely a plot element is to appear in alternative storylines.
arXiv Detail & Related papers (2024-12-31T04:54:48Z)
Evaluating Creative Short Story Generation in Humans and Large Language Models [0.7965327033045846]
Large language models (LLMs) have recently demonstrated the ability to generate high-quality stories. We conduct a systematic analysis of creativity in short story generation across LLMs and everyday people. Our findings reveal that while LLMs can generate stylistically complex stories, they tend to fall short in terms of creativity when compared to average human writers.
arXiv Detail & Related papers (2024-11-04T17:40:39Z)
Are Large Language Models Capable of Generating Human-Level Narratives? [114.34140090869175]
This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression. We introduce a novel computational framework to analyze narratives through three discourse-level aspects. We show that explicit integration of discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling.
arXiv Detail & Related papers (2024-07-18T08:02:49Z)
Pron vs Prompt: Can Large Language Models already Challenge a World-Class Fiction Author at Creative Text Writing? [0.8999666725996975]
Large Language Models (LLMs) outperform average humans in a wide range of language-related tasks. We have carried out a contest between Patricio Pron and GPT-4, in the spirit of AI-human duels such as DeepBlue vs Kasparov and AlphaGo vs Lee Sidol. The results indicate that LLMs are still far from challenging a top human creative writer.
arXiv Detail & Related papers (2024-07-01T09:28:58Z)
Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation [15.718288693929019]
Large Language Models (LLM) achieve state-of-the-art performance on many NLP tasks. We study whether LLMs can be used as substitutes for human annotators. We find that LLMs outperform current automatic measures for system-level evaluation but still struggle to provide satisfactory explanations.
arXiv Detail & Related papers (2024-05-22T15:56:52Z)
LFED: A Literary Fiction Evaluation Dataset for Large Language Models [58.85989777743013]
We collect 95 literary fictions that are either originally written in Chinese or translated into Chinese, covering a wide range of topics across several centuries. We define a question taxonomy with 8 question categories to guide the creation of 1,304 questions. We conduct an in-depth analysis to ascertain how specific attributes of literary fictions (e.g., novel types, character numbers, the year of publication) impact LLM performance in evaluations.
arXiv Detail & Related papers (2024-05-16T15:02:24Z)
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing [0.0]
We evaluate recent LLMs on English creative writing, a challenging and complex task that requires imagination, coherence, and style. We ask several LLMs and humans to write such a story and conduct a human evalution involving various criteria such as originality, humor, and style. Our results show that some state-of-the-art commercial LLMs match or slightly outperform our writers in most dimensions; whereas open-source LLMs lag behind.
arXiv Detail & Related papers (2023-10-12T15:56:24Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
In-Context Impersonation Reveals Large Language Models' Strengths and Biases [56.61129643802483]
We ask LLMs to assume different personas before solving vision and language tasks. We find that LLMs pretending to be children of different ages recover human-like developmental stages. In a language-based reasoning task, we find that LLMs impersonating domain experts perform better than LLMs impersonating non-domain experts.
arXiv Detail & Related papers (2023-05-24T09:13:15Z)
Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
Benchmarking Large Language Models for News Summarization [79.37850439866938]
Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. We find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability.
arXiv Detail & Related papers (2023-01-31T18:46:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.