Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
- URL: http://arxiv.org/abs/2505.19236v1
- Date: Sun, 25 May 2025 17:25:23 GMT
- Title: Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
- Authors: Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song,
- Abstract summary: Creativity evaluation remains a challenging frontier for large language models (LLMs)<n>We propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve consistency.<n>Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments.
- Score: 26.89750429841565
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.
Related papers
- Creativity in LLM-based Multi-Agent Systems: A Survey [56.25583236738877]
Large language model (LLM)-driven multi-agent systems (MAS) are transforming how humans and AIs collaboratively generate ideas and artifacts.<n>This is the first survey dedicated to creativity in MAS.<n>We focus on text and image generation tasks, and present: (1) a taxonomy of agent proactivity and persona design; (2) an overview of generation techniques, including divergent exploration, iterative refinement, and collaborative synthesis, as well as relevant datasets and evaluation metrics; and (3) a discussion of key challenges, such as inconsistent evaluation standards, insufficient bias mitigation, coordination conflicts, and the lack of unified benchmarks.
arXiv Detail & Related papers (2025-05-27T12:36:14Z) - Monocle: Hybrid Local-Global In-Context Evaluation for Long-Text Generation with Uncertainty-Based Active Learning [63.531262595858]
Divide-and-conquer approach breaks comprehensive evaluation task into localized scoring tasks, followed by a final global assessment.<n>We introduce a hybrid in-context learning approach that leverages human annotations to enhance the performance of both local and global evaluations.<n>Finally, we develop an uncertainty-based active learning algorithm that efficiently selects data samples for human annotation.
arXiv Detail & Related papers (2025-05-26T16:39:41Z) - Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach [32.654673913638426]
We propose an automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product.<n>Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts.
arXiv Detail & Related papers (2025-04-22T10:52:23Z) - SedarEval: Automated Evaluation using Self-Adaptive Rubrics [4.97150240417381]
We propose a new evaluation paradigm based on self-adaptive rubrics.<n>SedarEval consists of 1,000 meticulously crafted questions, each with its own self-adaptive rubric.<n>We train a specialized evaluator language model (evaluator LM) to supplant human graders.
arXiv Detail & Related papers (2025-01-26T16:45:09Z) - A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking.<n>LoTbench is an interactive, causality-aware evaluation framework.<n>Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z) - Optimizing the role of human evaluation in LLM-based spoken document summarization systems [0.0]
We propose an evaluation paradigm for spoken document summarization explicitly tailored for generative AI content.
We provide detailed evaluation criteria and best practices guidelines to ensure robustness in the experimental design, replicability, and trustworthiness of human evaluations.
arXiv Detail & Related papers (2024-10-23T18:37:14Z) - Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form Text [12.879551933541345]
Large Language Models (LLMs) are capable of generating human-like conversations.
Conventional metrics like BLEU and ROUGE are inadequate for capturing the subtle semantics and contextual richness of such generative outputs.
We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs-as-judges.
arXiv Detail & Related papers (2024-08-17T16:01:45Z) - Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks [65.69651759036535]
We analyze whether large language models (LLMs) can serve as reliable alternatives to humans.<n>This paper explores both conventional tasks (e.g., story generation) and alignment tasks (e.g., math reasoning)<n>We find that LLM evaluators can generate unnecessary criteria or omit crucial criteria, resulting in a slight deviation from the experts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.