Related papers: How do Humans and Language Models Reason About Creativity? A Comparative Analysis

How do Humans and Language Models Reason About Creativity? A Comparative Analysis

URL: http://arxiv.org/abs/2502.03253v2
Date: Mon, 05 May 2025 13:47:32 GMT
Title: How do Humans and Language Models Reason About Creativity? A Comparative Analysis
Authors: Antonio Laverghetta Jr., Tuhin Chakrabarty, Tom Hope, Jimmy Pronchick, Krupa Bhawsar, Roger E. Beaty,
Abstract summary: We conducted two experiments examining how including example solutions with ratings impact creativity evaluation.<n>In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training.<n>In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality.
Score: 12.398832289718703
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Creativity assessment in science and engineering is increasingly based on both human and AI judgment, but the cognitive processes and biases behind these evaluations remain poorly understood. We conducted two experiments examining how including example solutions with ratings impact creativity evaluation, using a finegrained annotation protocol where raters were tasked with explaining their originality scores and rating for the facets of remoteness (whether the response is "far" from everyday ideas), uncommonness (whether the response is rare), and cleverness. In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training, comparing those who received example solutions with ratings (example) to those who did not (no example). Computational text analysis revealed that, compared to experts with examples, no-example experts used more comparative language (e.g., "better/worse") and emphasized solution uncommonness, suggesting they may have relied more on memory retrieval for comparisons. In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality, suggesting an evaluative process rooted around the semantic similarity of ideas. In the example condition, while LLM accuracy in predicting the true originality scores improved, the correlations of remoteness, uncommonness, and cleverness with originality also increased substantially -- to upwards of $0.99$ -- suggesting a homogenization in the LLMs evaluation of the individual facets. These findings highlight important implications for how humans and AI reason about creativity and suggest diverging preferences for what different populations prioritize when rating.

Related papers

Empirically evaluating commonsense intelligence in large language models with large-scale human judgments [4.7206754497888035]
We propose a novel method for evaluating common sense in artificial intelligence.<n>We measure the correspondence between a model's judgment and that of a human population.<n>Our framework contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.
arXiv Detail & Related papers (2025-05-15T13:55:27Z)
Quantitative Assessment of Intersectional Empathetic Bias and Understanding [0.0]
A growing amount of literature critiques the current operationalizations of empathy based on loose definitions of the construct. We propose an empathy evaluation framework that operationalizes empathy close to its psychological origins.
arXiv Detail & Related papers (2024-11-08T18:43:15Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Good Idea or Not, Representation of LLM Could Tell [86.36317971482755]
We focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas. We release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task. Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs.
arXiv Detail & Related papers (2024-09-07T02:07:22Z)
(De)Noise: Moderating the Inconsistency Between Human Decision-Makers [15.291993233528526]
We study whether algorithmic decision aids can be used to moderate the degree of inconsistency in human decision-making in the context of real estate appraisal. We find that both (i) asking respondents to review their estimates in a series of algorithmically chosen pairwise comparisons and (ii) providing respondents with traditional machine advice are effective strategies for influencing human responses.
arXiv Detail & Related papers (2024-07-15T20:24:36Z)
Mitigating Cognitive Biases in Multi-Criteria Crowd Assessment [22.540544209683592]
We focus on cognitive biases associated with a multi-criteria assessment in crowdsourcing. Crowdworkers who rate targets with multiple different criteria simultaneously may provide biased responses due to prominence of some criteria or global impressions of the evaluation targets. We propose two specific model structures for Bayesian opinion aggregation models that consider inter-criteria relations.
arXiv Detail & Related papers (2024-07-10T16:00:23Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models [53.00812898384698]
We argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. We propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability.
arXiv Detail & Related papers (2024-05-28T22:45:28Z)
Evaluating Agents using Social Choice Theory [20.58298173034909]
We argue that many general evaluation problems can be viewed through the lens of voting theory.<n>Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation.<n>These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation.
arXiv Detail & Related papers (2023-12-05T20:40:37Z)
Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs) Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z)
Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs) We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric. Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z)
Rethinking and Refining the Distinct Metric [61.213465863627476]
We refine the calculation of distinct scores by re-scaling the number of distinct tokens based on its expectation. We provide both empirical and theoretical evidence to show that our method effectively removes the biases exhibited in the original distinct score.
arXiv Detail & Related papers (2022-02-28T07:36:30Z)
SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption Evaluation via Typicality Analysis [20.026835809227283]
We introduce "typicality", a new formulation of evaluation rooted in information theory. We show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences. Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
arXiv Detail & Related papers (2021-06-02T19:58:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.