Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
- URL: http://arxiv.org/abs/2508.05470v1
- Date: Thu, 07 Aug 2025 15:11:48 GMT
- Title: Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations
- Authors: Li-Chun Lu, Miri Liu, Pin-Chun Lu, Yufei Tian, Shao-Hua Sun, Nanyun Peng,
- Abstract summary: We examine creativity measures including the creativity index, perplexity, syntactic templates, and LLM-as-a-Judge.<n>Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity.
- Score: 47.51585845370428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We systematically examine, analyze, and compare representative creativity measures--creativity index, perplexity, syntactic templates, and LLM-as-a-Judge--across diverse creative domains, including creative writing, unconventional problem-solving, and research ideation. Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity. We highlight key limitations, including the creativity index's focus on lexical diversity, perplexity's sensitivity to model confidence, and syntactic templates' inability to capture conceptual creativity. Additionally, LLM-as-a-Judge shows instability and bias. Our findings underscore the need for more robust, generalizable evaluation frameworks that better align with human judgments of creativity.
Related papers
- Creativity in LLM-based Multi-Agent Systems: A Survey [56.25583236738877]
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts.<n>This is the first survey dedicated to creativity in MAS.<n>We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks.
arXiv Detail & Related papers (2025-05-27T12:36:14Z) - Creative Preference Optimization [13.266256662558822]
We propose a novel alignment method that injects signals from multiple creativity dimensions into the preference optimization objective in a modular fashion.<n>We train and evaluate creativity-augmented versions of several models using CrPO and MuCE, a new large-scale human preference dataset.<n>Our models outperform strong baselines, including GPT-4o, on both automated and human evaluations.
arXiv Detail & Related papers (2025-05-20T14:43:41Z) - Cooking Up Creativity: A Cognitively-Inspired Approach for Enhancing LLM Creativity through Structured Representations [53.950760059792614]
Large Language Models (LLMs) excel at countless tasks, yet struggle with creativity.<n>We introduce a novel approach that couples LLMs with structured representations and cognitively inspired manipulations to generate more creative and diverse ideas.<n>We demonstrate our approach in the culinary domain with DishCOVER, a model that generates creative recipes.
arXiv Detail & Related papers (2025-04-29T11:13:06Z) - Probing and Inducing Combinational Creativity in Vision-Language Models [52.76981145923602]
Recent advances in Vision-Language Models (VLMs) have sparked debate about whether their outputs reflect combinational creativity.<n>We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels.<n>To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework.
arXiv Detail & Related papers (2025-04-17T17:38:18Z) - Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLM [58.42678619252968]
Creation-MMBench is a benchmark designed to evaluate the creative capabilities of Multimodal Large Language Models.<n>The benchmark comprises 765 test cases spanning 51 fine-grained tasks.<n> Experimental results reveal that open-source MLLMs significantly underperform compared to proprietary models in creative tasks.
arXiv Detail & Related papers (2025-03-18T17:51:34Z) - Leveraging Large Models for Evaluating Novel Content: A Case Study on Advertisement Creativity [26.90276644134837]
We attempt to break down visual advertisement creativity into atypicality and originality.<n>With fine-grained human annotations, we propose a suit of tasks specifically for such a subjective problem.<n>We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLM) and humans on our proposed benchmark.
arXiv Detail & Related papers (2025-02-26T04:28:03Z) - A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking.<n>LoTbench is an interactive, causality-aware evaluation framework.<n>Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z) - LiveIdeaBench: Evaluating LLMs' Divergent Thinking for Scientific Idea Generation with Minimal Context [13.967898012303325]
We introduce LiveIdeaBench, a benchmark evaluating Large Language Models' scientific idea generation.<n>Our benchmark employs a dynamic panel of state-of-the-art LLMs to assess generated ideas across five key dimensions: originality, feasibility, fluency, flexibility, and clarity.<n>Our results demonstrate that models like QwQ-32B-preview achieve creative performance comparable to top-tier models such as claude-3.7-sonnet:thinking, despite significant gaps in their general intelligence scores.
arXiv Detail & Related papers (2024-12-23T14:13:44Z) - Benchmarking Language Model Creativity: A Case Study on Code Generation [39.546827184857754]
In this work, we introduce a framework for quantifying LLM creativity.<n>We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses.<n>We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions.
arXiv Detail & Related papers (2024-07-12T05:55:22Z) - Assessing and Understanding Creativity in Large Language Models [33.37237667182931]
This paper aims to establish an efficient framework for assessing the level of creativity in large language models (LLMs)
By adapting the Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks.
We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration.
arXiv Detail & Related papers (2024-01-23T05:19:47Z) - Can AI Be as Creative as Humans? [84.43873277557852]
We prove in theory that AI can be as creative as humans under the condition that it can properly fit the data generated by human creators.
The debate on AI's creativity is reduced into the question of its ability to fit a sufficient amount of data.
arXiv Detail & Related papers (2024-01-03T08:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.