Forma mentis networks predict creativity ratings of short texts via interpretable artificial intelligence in human and GPT-simulated raters
- URL: http://arxiv.org/abs/2412.00530v1
- Date: Sat, 30 Nov 2024 16:33:48 GMT
- Title: Forma mentis networks predict creativity ratings of short texts via interpretable artificial intelligence in human and GPT-simulated raters
- Authors: Edith Haim, Natalie Fischer, Salvatore Citraro, Giulio Rossetti, Massimo Stella,
- Abstract summary: Using Explainable Artificial Intelligence (XAI), we test whether features relative to Mednick's associative theory of creativity can explain creativity ratings assigned by humans and GPT-3.5.
GPT-3.5 favours 'its own' stories and rates human stories differently from humans.
- Score: 0.1398098625978622
- License:
- Abstract: Creativity is a fundamental skill of human cognition. We use textual forma mentis networks (TFMN) to extract network (semantic/syntactic associations) and emotional features from approximately one thousand human- and GPT3.5-generated stories. Using Explainable Artificial Intelligence (XAI), we test whether features relative to Mednick's associative theory of creativity can explain creativity ratings assigned by humans and GPT-3.5. Using XGBoost, we examine three scenarios: (i) human ratings of human stories, (ii) GPT-3.5 ratings of human stories, and (iii) GPT-3.5 ratings of GPT-generated stories. Our findings reveal that GPT-3.5 ratings differ significantly from human ratings not only in terms of correlations but also because of feature patterns identified with XAI methods. GPT-3.5 favours 'its own' stories and rates human stories differently from humans. Feature importance analysis with SHAP scores shows that: (i) network features are more predictive for human creativity ratings but also for GPT-3.5's ratings of human stories; (ii) emotional features played a greater role than semantic/syntactic network structure in GPT-3.5 rating its own stories. These quantitative results underscore key limitations in GPT-3.5's ability to align with human assessments of creativity. We emphasise the need for caution when using GPT-3.5 to assess and generate creative content, as it does not yet capture the nuanced complexity that characterises human creativity.
Related papers
- Human Bias in the Face of AI: The Role of Human Judgement in AI Generated Text Evaluation [48.70176791365903]
This study explores how bias shapes the perception of AI versus human generated content.
We investigated how human raters respond to labeled and unlabeled content.
arXiv Detail & Related papers (2024-09-29T04:31:45Z) - Measuring Psychological Depth in Language Models [50.48914935872879]
We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM's ability to produce authentic and narratively complex stories.
We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff's alpha)
Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit.
arXiv Detail & Related papers (2024-06-18T14:51:54Z) - Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Reddit's Showerthoughts [17.369951848952265]
We investigate the ability of LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts.
We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts.
We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts.
arXiv Detail & Related papers (2024-05-02T18:29:58Z) - How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO [55.25989137825992]
We introduce ECHO, an evaluative framework inspired by the Turing test.
This framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses.
We evaluate three role-playing LLMs using ECHO, with GPT-3.5 and GPT-4 serving as foundational models.
arXiv Detail & Related papers (2024-04-22T08:00:51Z) - GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation [93.55550787058012]
This paper presents an automatic, versatile, and human-aligned evaluation metric for text-to-3D generative models.
To this end, we first develop a prompt generator using GPT-4V to generate evaluating prompts.
We then design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria.
arXiv Detail & Related papers (2024-01-08T18:52:09Z) - ChatGPT as a commenter to the news: can LLMs generate human-like
opinions? [3.0309690768567754]
We investigate what extent GPT-3.5 can generate human-like comments on Dutch news articles.
We analyze human likeness across multiple prompting techniques.
We find that our fine-tuned BERT models can easily distinguish human-written comments from GPT-3.5 generated comments.
arXiv Detail & Related papers (2023-12-21T15:46:36Z) - Inductive reasoning in humans and large language models [0.0]
We apply GPT-3.5 and GPT-4 to a classic problem in human inductive reasoning known as property induction.
Although GPT-3.5 struggles to capture many aspects of human behaviour, GPT-4 is much more successful.
arXiv Detail & Related papers (2023-06-11T00:23:25Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated
Text [46.260544251940125]
We run a study assessing non-experts' ability to distinguish between human- and machine-authored text.
We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level.
arXiv Detail & Related papers (2021-06-30T19:00:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.