Art or Artifice? Large Language Models and the False Promise of
Creativity
- URL: http://arxiv.org/abs/2309.14556v3
- Date: Fri, 8 Mar 2024 05:20:08 GMT
- Title: Art or Artifice? Large Language Models and the False Promise of
Creativity
- Authors: Tuhin Chakrabarty, Philippe Laban, Divyansh Agarwal, Smaranda Muresan,
Chien-Sheng Wu
- Abstract summary: We propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product.
TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration.
Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals.
- Score: 53.04834589006685
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Researchers have argued that large language models (LLMs) exhibit
high-quality writing capabilities from blogs to stories. However, evaluating
objectively the creativity of a piece of writing is challenging. Inspired by
the Torrance Test of Creative Thinking (TTCT), which measures creativity as a
process, we use the Consensual Assessment Technique [3] and propose the
Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product.
TTCW consists of 14 binary tests organized into the original dimensions of
Fluency, Flexibility, Originality, and Elaboration. We recruit 10 creative
writers and implement a human assessment of 48 stories written either by
professional authors or LLMs using TTCW. Our analysis shows that LLM-generated
stories pass 3-10X less TTCW tests than stories written by professionals. In
addition, we explore the use of LLMs as assessors to automate the TTCW
evaluation, revealing that none of the LLMs positively correlate with the
expert assessments.
Related papers
- Do LLMs Agree on the Creativity Evaluation of Alternative Uses? [0.4326762849037007]
This paper investigates whether large language models (LLMs) show agreement in assessing creativity in responses to the Alternative Uses Test (AUT)
Using an oracle benchmark set of AUT responses, we experiment with four state-of-the-art LLMs evaluating these outputs.
Results reveal high inter-model agreement, with Spearman correlations averaging above 0.7 across models and reaching over 0.77 with respect to the oracle.
arXiv Detail & Related papers (2024-11-23T13:34:50Z) - Evaluating Creative Short Story Generation in Humans and Large Language Models [0.7965327033045846]
Large language models (LLMs) have recently demonstrated the ability to generate high-quality stories.
We conduct a systematic analysis of creativity in short story generation across LLMs and everyday people.
Our findings reveal that while LLMs can generate stylistically complex stories, they tend to fall short in terms of creativity when compared to average human writers.
arXiv Detail & Related papers (2024-11-04T17:40:39Z) - AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text [53.15652021126663]
We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text.
To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm.
Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs.
arXiv Detail & Related papers (2024-10-05T18:55:01Z) - The creative psychometric item generator: a framework for item generation and validation using large language models [1.765099515298011]
Large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity.
We develop a psychometrically inspired framework for creating test items for a classic free-response creativity test: the creative problem-solving (CPS) task.
We find strong empirical evidence that CPIG generates valid and reliable items and that this effect is not attributable to known biases in the evaluation process.
arXiv Detail & Related papers (2024-08-30T18:31:02Z) - Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts [49.97673761305336]
We evaluate three large language models (LLMs) for their alignment with human narrative styles and potential gender biases.
Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases.
arXiv Detail & Related papers (2024-06-27T19:26:11Z) - Assessing and Understanding Creativity in Large Language Models [33.37237667182931]
This paper aims to establish an efficient framework for assessing the level of creativity in large language models (LLMs)
By adapting the Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks.
We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration.
arXiv Detail & Related papers (2024-01-23T05:19:47Z) - Evaluating Large Language Model Creativity from a Literary Perspective [13.672268920902187]
This paper assesses the potential for large language models to serve as assistive tools in the creative writing process.
We develop interactive and multi-voice prompting strategies that interleave background descriptions, instructions that guide composition, samples of text in the target style, and critical discussion of the given samples.
arXiv Detail & Related papers (2023-11-30T16:46:25Z) - Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs.
We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Exploring the Use of Large Language Models for Reference-Free Text
Quality Evaluation: An Empirical Study [63.27346930921658]
ChatGPT is capable of evaluating text quality effectively from various perspectives without reference.
The Explicit Score, which utilizes ChatGPT to generate a numeric score measuring text quality, is the most effective and reliable method among the three exploited approaches.
arXiv Detail & Related papers (2023-04-03T05:29:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.