LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers
- URL: http://arxiv.org/abs/2602.16162v1
- Date: Wed, 18 Feb 2026 03:19:12 GMT
- Title: LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers
- Authors: Peiqi Sui,
- Abstract summary: We show that human writing consistently exhibits significantly higher uncertainty than model outputs.<n>We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts.
- Score: 1.9036571490366498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We argue that uncertainty is a key and understudied limitation of LLMs' performance in creative writing, which is often characterized as trite and cliché-ridden. Literary theory identifies uncertainty as a necessary condition for creative expression, while current alignment strategies steer models away from uncertain outputs to ensure factuality and reduce hallucination. We formalize this tension by quantifying the "uncertainty gap" between human-authored stories and model-generated continuations. Through a controlled information-theoretic analysis of 28 LLMs on high-quality storytelling datasets, we demonstrate that human writing consistently exhibits significantly higher uncertainty than model outputs. We find that instruction-tuned and reasoning models exacerbate this trend compared to their base counterparts; furthermore, the gap is more pronounced in creative writing than in functional domains, and strongly correlates to writing quality. Achieving human-level creativity requires new uncertainty-aware alignment paradigms that can distinguish between destructive hallucinations and the constructive ambiguity required for literary richness.
Related papers
- Beyond Divergent Creativity: A Human-Based Evaluation of Creativity in Large Language Models [6.036586911740041]
Large language models (LLMs) are increasingly used in verbal creative tasks.<n>The widely used Divergent Association Task ( DAT) focuses on novelty, ignoring appropriateness.<n>We evaluate a range of state-of-the-art LLMs on DAT and show that their scores on the task are lower than those of two baselines that do not possess any creative abilities.
arXiv Detail & Related papers (2026-01-28T12:41:32Z) - Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z) - CreativityPrism: A Holistic Benchmark for Large Language Model Creativity [64.18257552903151]
Creativity is often seen as a hallmark of human intelligence.<n>There is still no holistic framework to evaluate their creativity across diverse scenarios.<n>We propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity.
arXiv Detail & Related papers (2025-10-23T00:22:10Z) - Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity [29.58419742230708]
N-gram novelty is widely used to evaluate language models' ability to generate text outside of their training data.<n>We investigate the relationship between this notion of creativity and n-gram novelty through close reading of human and AI-generated text.<n>We find that while n-gram novelty is positively associated with expert writer-judged creativity, 91% of top-quartile expressions by n-gram novelty are not judged as creative.
arXiv Detail & Related papers (2025-09-26T17:59:05Z) - MORABLES: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables [50.29407048003165]
We present MORABLES, a human-verified benchmark built from fables and short stories drawn from historical literature.<n>The main task is structured as multiple-choice questions targeting moral inference, with carefully crafted distractors that challenge models to go beyond shallow, extractive question answering.<n>Our findings show that, while larger models outperform smaller ones, they remain susceptible to adversarial manipulation and often rely on superficial patterns rather than true moral reasoning.
arXiv Detail & Related papers (2025-09-15T19:06:10Z) - Rethinking Creativity Evaluation: A Critical Analysis of Existing Creativity Evaluations [48.57816792550401]
We examine creativity measures including the creativity index, perplexity, syntactic templates, and LLM-as-a-Judge.<n>Our analyses reveal that these metrics exhibit limited consistency, capturing different dimensions of creativity.
arXiv Detail & Related papers (2025-08-07T15:11:48Z) - Investigating Factuality in Long-Form Text Generation: The Roles of Self-Known and Self-Unknown [68.33486915047014]
We investigate the factuality of long-form text generation across various large language models (LLMs)<n>Our analysis reveals that factuality tend to decline in later sentences of the generated text, accompanied by a rise in the number of unsupported claims.
arXiv Detail & Related papers (2024-11-24T22:06:26Z) - Creativity Has Left the Chat: The Price of Debiasing Language Models [1.223779595809275]
We investigate the unintended consequences of Reinforcement Learning from Human Feedback on the creativity of Large Language Models (LLMs)
Our findings have significant implications for marketers who rely on LLMs for creative tasks such as copywriting, ad creation, and customer persona generation.
arXiv Detail & Related papers (2024-06-08T22:14:51Z) - Divergent Creativity in Humans and Large Language Models [37.67363469600804]
Large Language Models (LLMs) have led to claims that they are approaching a level of creativity akin to human capabilities.<n>We leverage recent advances in computational creativity to analyze semantic divergence in both state-of-the-art LLMs and a dataset of 100,000 humans.<n>We found evidence that LLMs can surpass average human performance on the Divergent Association Task, and approach human creative writing abilities.
arXiv Detail & Related papers (2024-05-13T22:37:52Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.