Evaluation Framework for AI Creativity: A Case Study Based on Story Generation
- URL: http://arxiv.org/abs/2601.03698v1
- Date: Wed, 07 Jan 2026 08:31:08 GMT
- Title: Evaluation Framework for AI Creativity: A Case Study Based on Story Generation
- Authors: Pharath Sathya, Yin Jou Huang, Fei Cheng,
- Abstract summary: evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity.<n>We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components.<n>Using controlled story generation via Spike Prompting'' and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments.
- Score: 5.536493649574258
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating creative text generation remains a challenge because existing reference-based metrics fail to capture the subjective nature of creativity. We propose a structured evaluation framework for AI story generation comprising four components (Novelty, Value, Adherence, and Resonance) and eleven sub-components. Using controlled story generation via ``Spike Prompting'' and a crowdsourced study of 115 readers, we examine how different creative components shape both immediate and reflective human creativity judgments. Our findings show that creativity is evaluated hierarchically rather than cumulatively, with different dimensions becoming salient at different stages of judgment, and that reflective evaluation substantially alters both ratings and inter-rater agreement. Together, these results support the effectiveness of our framework in revealing dimensions of creativity that are obscured by reference-based evaluation.
Related papers
- InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem [87.30601926271864]
InnoEval is a deep innovation evaluation framework designed to emulate human-level idea assessment.<n>We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources.<n>We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval.
arXiv Detail & Related papers (2026-02-16T00:40:31Z) - Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings [18.09092203643732]
We propose a data-driven framework for automatic and interpretable creativity assessment from drawings.<n>Motivated by the cognitive evidence proposed in [6] that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions.
arXiv Detail & Related papers (2025-11-17T02:16:01Z) - CreativityPrism: A Holistic Benchmark for Large Language Model Creativity [64.18257552903151]
Creativity is often seen as a hallmark of human intelligence.<n>There is still no holistic framework to evaluate their creativity across diverse scenarios.<n>We propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity.
arXiv Detail & Related papers (2025-10-23T00:22:10Z) - Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment [4.334576480811837]
We propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing.<n>Our method is especially useful in subjective evaluations where not all the annotators agree with each other.
arXiv Detail & Related papers (2025-10-01T04:29:36Z) - Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations [48.57816792550401]
We examine creativity measures including the creativity index, perplexity, syntactic templates, and LLM-as-a-Judge.<n>Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity.
arXiv Detail & Related papers (2025-08-07T15:11:48Z) - Creativity in LLM-based Multi-Agent Systems: A Survey [56.25583236738877]
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts.<n>This is the first survey dedicated to creativity in MAS.<n>We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks.
arXiv Detail & Related papers (2025-05-27T12:36:14Z) - Probing and Inducing Combinational Creativity in Vision-Language Models [52.76981145923602]
Recent advances in Vision-Language Models (VLMs) have sparked debate about whether their outputs reflect combinational creativity.<n>We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels.<n>To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework.
arXiv Detail & Related papers (2025-04-17T17:38:18Z) - Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation [5.734448042909701]
Large language models for creative tasks often lack diversity.<n>Common solutions, such as sampling at higher temperatures, can compromise the quality of the results.<n>We propose a context-based score to quantitatively evaluate value and originality.
arXiv Detail & Related papers (2025-02-18T19:00:01Z) - Can AI Be as Creative as Humans? [84.43873277557852]
We prove in theory that AI can be as creative as humans under the condition that it can properly fit the data generated by human creators.
The debate on AI's creativity is reduced into the question of its ability to fit a sufficient amount of data.
arXiv Detail & Related papers (2024-01-03T08:49:12Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.