Related papers: Art or Artifice? Large Language Models and the False Promise of Creativity

Art or Artifice? Large Language Models and the False Promise of Creativity

URL: http://arxiv.org/abs/2309.14556v3
Date: Fri, 8 Mar 2024 05:20:08 GMT
Title: Art or Artifice? Large Language Models and the False Promise of Creativity
Authors: Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan, Chien-Sheng Wu
Abstract summary: We propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals.
Score: 53.04834589006685
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Researchers have argued that large language models (LLMs) exhibit high-quality writing capabilities from blogs to stories. However, evaluating objectively the creativity of a piece of writing is challenging. Inspired by the Torrance Test of Creative Thinking (TTCT), which measures creativity as a process, we use the Consensual Assessment Technique [3] and propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative writers and implement a human assessment of 48 stories written either by professional authors or LLMs using TTCW. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals. In addition, we explore the use of LLMs as assessors to automate the TTCW evaluation, revealing that none of the LLMs positively correlate with the expert assessments.

Related papers

Help Me Write a Story: Evaluating LLMs' Ability to Generate Writing Feedback [57.200668979963694]
We present a novel test set of 1,300 stories that we corrupted to intentionally introduce writing issues.<n>We study the performance of commonly used LLMs in this task with both automatic and human evaluation metrics.
arXiv Detail & Related papers (2025-07-21T18:56:50Z)
Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach [32.654673913638426]
We propose an automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product. Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts.
arXiv Detail & Related papers (2025-04-22T10:52:23Z)
A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking. LoTbench is an interactive, causality-aware evaluation framework. Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z)
Do LLMs Agree on the Creativity Evaluation of Alternative Uses? [0.4326762849037007]
This paper investigates whether large language models (LLMs) show agreement in assessing creativity in responses to the Alternative Uses Test (AUT) Using an oracle benchmark set of AUT responses, we experiment with four state-of-the-art LLMs evaluating these outputs. Results reveal high inter-model agreement, with Spearman correlations averaging above 0.7 across models and reaching over 0.77 with respect to the oracle.
arXiv Detail & Related papers (2024-11-23T13:34:50Z)
Evaluating Creative Short Story Generation in Humans and Large Language Models [0.7965327033045846]
Large language models (LLMs) have recently demonstrated the ability to generate high-quality stories. We conduct a systematic analysis of creativity in short story generation across LLMs and everyday people. Our findings reveal that while LLMs can generate stylistically complex stories, they tend to fall short in terms of creativity when compared to average human writers.
arXiv Detail & Related papers (2024-11-04T17:40:39Z)
AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text [53.15652021126663]
We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text. To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm. Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs.
arXiv Detail & Related papers (2024-10-05T18:55:01Z)
The creative psychometric item generator: a framework for item generation and validation using large language models [1.765099515298011]
Large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. We develop a psychometrically inspired framework for creating test items for a classic free-response creativity test: the creative problem-solving (CPS) task. We find strong empirical evidence that CPIG generates valid and reliable items and that this effect is not attributable to known biases in the evaluation process.
arXiv Detail & Related papers (2024-08-30T18:31:02Z)
Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts [49.97673761305336]
We evaluate three large language models (LLMs) for their alignment with human narrative styles and potential gender biases. Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases.
arXiv Detail & Related papers (2024-06-27T19:26:11Z)
Assessing and Understanding Creativity in Large Language Models [33.37237667182931]
This paper aims to establish an efficient framework for assessing the level of creativity in large language models (LLMs) By adapting the Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks. We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration.
arXiv Detail & Related papers (2024-01-23T05:19:47Z)
Evaluating Large Language Model Creativity from a Literary Perspective [13.672268920902187]
This paper assesses the potential for large language models to serve as assistive tools in the creative writing process. We develop interactive and multi-voice prompting strategies that interleave background descriptions, instructions that guide composition, samples of text in the target style, and critical discussion of the given samples.
arXiv Detail & Related papers (2023-11-30T16:46:25Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference. The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.