Related papers: Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment

URL: http://arxiv.org/abs/2510.05135v1
Date: Wed, 01 Oct 2025 04:29:36 GMT
Title: Curiosity-Driven LLM-as-a-judge for Personalized Creative Judgment
Authors: Vanya Bannihatti Kumar, Divyanshu Goyal, Akhil Eppa, Neel Bhandari,
Abstract summary: We propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing.<n>Our method is especially useful in subjective evaluations where not all the annotators agree with each other.
Score: 4.334576480811837
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern large language models (LLMs) excel at objective tasks such as evaluating mathematical reasoning and factual accuracy, yet they falter when faced with the nuanced, subjective nature of assessing creativity. In this work, we propose a novel curiosity-driven LLM-as-a-judge for evaluating creative writing which is personlized to each individual's creative judgments. We use the Torrance Test of Creative Thinking(TTCW) benchmark introduced in Chakrabarty et al. (2024), which has stories annotated by expert humans across various subjective dimensions like Originality, to test our hypothesis. We show that our method enables models across various sizes, to learn the nuanced creative judgments of different individuals, by showing improvements over baseline supervised finetuning(SFT) method across various evaluation metrics like Pearson correlation, Cohen's and F1 values. Our method is especially useful in subjective evaluations where not all the annotators agree with each other.

Related papers

Reward Modeling for Scientific Writing Evaluation [50.33952894976367]
It is critical to develop models that can be reliably deployed for evaluating diverse open-ended scientific writing tasks.<n>We propose cost-efficient, open-source reward models tailored for scientific writing evaluation.
arXiv Detail & Related papers (2026-01-16T15:32:58Z)
Simple Lines, Big Ideas: Towards Interpretable Assessment of Human Creativity from Drawings [18.09092203643732]
We propose a data-driven framework for automatic and interpretable creativity assessment from drawings.<n>Motivated by the cognitive evidence proposed in [6] that creativity can emerge from both what is drawn (content) and how it is drawn (style), we reinterpret the creativity score as a function of these two complementary dimensions.
arXiv Detail & Related papers (2025-11-17T02:16:01Z)
CreativityPrism: A Holistic Benchmark for Large Language Model Creativity [64.18257552903151]
Creativity is often seen as a hallmark of human intelligence.<n>There is still no holistic framework to evaluate their creativity across diverse scenarios.<n>We propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity.
arXiv Detail & Related papers (2025-10-23T00:22:10Z)
Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations [48.57816792550401]
We examine creativity measures including the creativity index, perplexity, syntactic templates, and LLM-as-a-Judge.<n>Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity.
arXiv Detail & Related papers (2025-08-07T15:11:48Z)
Automated Creativity Evaluation for Large Language Models: A Reference-Based Approach [32.654673913638426]
We propose an automated evaluation method based on the Torrance Test of Creative Writing (TTCW), which evaluates creativity as product.<n>Our method employs a reference-based Likert-style approach, scoring generated creative texts relative to high-quality reference texts.
arXiv Detail & Related papers (2025-04-22T10:52:23Z)
Leveraging Large Models to Evaluate Novel Content: A Case Study on Advertisement Creativity [25.460598990334077]
We attempt to break down visual advertisement creativity into atypicality and originality.<n>We propose a suite of tasks specifically for such a subjective problem.<n>We also evaluate the alignment between state-of-the-art (SoTA) vision language models (VLMs) and humans on our proposed benchmark.
arXiv Detail & Related papers (2025-02-26T04:28:03Z)
How do Humans and Language Models Reason About Creativity? A Comparative Analysis [12.398832289718703]
We conducted two experiments examining how including example solutions with ratings impact creativity evaluation.<n>In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training.<n>In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality.
arXiv Detail & Related papers (2025-02-05T15:08:43Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
Evaluating Large Language Models with Psychometrics [59.821829073478376]
This paper offers a comprehensive benchmark for quantifying psychological constructs of Large Language Models (LLMs)<n>Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets.<n>We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
Art or Artifice? Large Language Models and the False Promise of Creativity [53.04834589006685]
We propose the Torrance Test of Creative Writing (TTCW) to evaluate creativity as a product. TTCW consists of 14 binary tests organized into the original dimensions of Fluency, Flexibility, Originality, and Elaboration. Our analysis shows that LLM-generated stories pass 3-10X less TTCW tests than stories written by professionals.
arXiv Detail & Related papers (2023-09-25T22:02:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.