Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability
- URL: http://arxiv.org/abs/2504.12320v1
- Date: Thu, 10 Apr 2025 19:18:56 GMT
- Title: Has the Creativity of Large-Language Models peaked? An analysis of inter- and intra-LLM variability
- Authors: Jennifer Haase, Paul H. P. Hanel, Sebastian Pokutta,
- Abstract summary: We evaluated 14 widely used large language models (LLMs) across two validated creativity assessments.<n>Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months.<n>We document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original.
- Score: 23.509294903995745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Following the widespread adoption of ChatGPT in early 2023, numerous studies reported that large language models (LLMs) can match or even surpass human performance in creative tasks. However, it remains unclear whether LLMs have become more creative over time, and how consistent their creative output is. In this study, we evaluated 14 widely used LLMs -- including GPT-4, Claude, Llama, Grok, Mistral, and DeepSeek -- across two validated creativity assessments: the Divergent Association Task (DAT) and the Alternative Uses Task (AUT). Contrary to expectations, we found no evidence of increased creative performance over the past 18-24 months, with GPT-4 performing worse than in previous studies. For the more widely used AUT, all models performed on average better than the average human, with GPT-4o and o3-mini performing best. However, only 0.28% of LLM-generated responses reached the top 10% of human creativity benchmarks. Beyond inter-model differences, we document substantial intra-model variability: the same LLM, given the same prompt, can produce outputs ranging from below-average to original. This variability has important implications for both creativity research and practical applications. Ignoring such variability risks misjudging the creative potential of LLMs, either inflating or underestimating their capabilities. The choice of prompts affected LLMs differently. Our findings underscore the need for more nuanced evaluation frameworks and highlight the importance of model selection, prompt design, and repeated assessment when using Generative AI (GenAI) tools in creative contexts.
Related papers
- Modifying Large Language Model Post-Training for Diverse Creative Writing [12.872333448726595]
In creative writing generation, we investigate post-training approaches to promote both output diversity and quality.
Our core idea is to include deviation in the training objective to facilitate learning from rare high-quality instances.
Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models.
arXiv Detail & Related papers (2025-03-21T13:21:45Z) - Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM [58.42678619252968]
Creation-MMBench is a benchmark designed to evaluate the creative capabilities of Multimodal Large Language Models.<n>The benchmark comprises 765 test cases spanning 51 fine-grained tasks.<n> Experimental results reveal that open-source MLLMs significantly underperform compared to proprietary models in creative tasks.
arXiv Detail & Related papers (2025-03-18T17:51:34Z) - A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking.<n>LoTbench is an interactive, causality-aware evaluation framework.<n>Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z) - LiveIdeaBench: Evaluating LLMs' Scientific Creativity and Idea Generation with Minimal Context [13.967898012303325]
We introduce LiveIdeaBench, a benchmark that evaluates Large Language Models' scientific creativity and divergent thinking capabilities.<n>Our framework employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across four key dimensions: originality, feasibility, fluency, and flexibility.
arXiv Detail & Related papers (2024-12-23T14:13:44Z) - Do LLMs Agree on the Creativity Evaluation of Alternative Uses? [0.4326762849037007]
This paper investigates whether large language models (LLMs) show agreement in assessing creativity in responses to the Alternative Uses Test (AUT)
Using an oracle benchmark set of AUT responses, we experiment with four state-of-the-art LLMs evaluating these outputs.
Results reveal high inter-model agreement, with Spearman correlations averaging above 0.7 across models and reaching over 0.77 with respect to the oracle.
arXiv Detail & Related papers (2024-11-23T13:34:50Z) - Assessing and Understanding Creativity in Large Language Models [33.37237667182931]
This paper aims to establish an efficient framework for assessing the level of creativity in large language models (LLMs)
By adapting the Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks.
We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration.
arXiv Detail & Related papers (2024-01-23T05:19:47Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z) - Probing the Creativity of Large Language Models: Can models produce
divergent semantic association? [9.992602859777689]
The present study aims to investigate the creative thinking of large language models through a cognitive perspective.
We utilize the divergent association task ( DAT), an objective measurement of creativity that asks models to generate unrelated words and calculates the semantic distance between them.
Our results imply that advanced large language models have divergent semantic associations, which is a fundamental process underlying creativity.
arXiv Detail & Related papers (2023-10-17T11:23:32Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents [53.78782375511531]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.