Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?
- URL: http://arxiv.org/abs/2602.02290v1
- Date: Mon, 02 Feb 2026 16:29:32 GMT
- Title: Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?
- Authors: Alex Argese, Pasquale Lisena, Raphaƫl Troncy,
- Abstract summary: We propose StoryScore, a composite metric for evaluating AI-generated scientific stories.<n>StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection.
- Score: 0.5349058473848842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.
Related papers
- The Art of Generative Narrativity [0.0]
generative AI leads to experiments with non-verbal forms that have the potential to incite narratives through the audience's experience.<n>In five central sections, we discuss interrelated exemplars whose conceptual frameworks anticipate or underscore the issues of contemporary linguistic automation.<n>In closing sections, we summarize the expressive features of these exemplars and underline their value for critically assessing generative AI's cultural influence and fallouts.
arXiv Detail & Related papers (2026-03-01T12:58:24Z) - NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control [59.6128550986024]
NarraScore is a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic.<n>NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism.<n>NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead.
arXiv Detail & Related papers (2026-02-09T09:39:42Z) - Incentives or Ontology? A Structural Rebuttal to OpenAI's Hallucination Thesis [0.42970700836450487]
We argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model.<n>Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules.<n>We conclude that hallucination is a structural property of generative architectures.
arXiv Detail & Related papers (2025-12-16T17:39:45Z) - How Large Language Models are Designed to Hallucinate [0.42970700836450487]
We argue that hallucination is a structural outcome of the transformer architecture.<n>Our contribution is threefold: (1) a comparative account showing why existing explanations are insufficient; (2) a predictive taxonomy of hallucination linked to existential structures with proposed benchmarks; and (3) design directions toward "truth-constrained" architectures capable of withholding or deferring when disclosure is absent.
arXiv Detail & Related papers (2025-09-19T16:46:27Z) - HalluLens: LLM Hallucination Benchmark [49.170128733508335]
Large language models (LLMs) often generate responses that deviate from user input or training data, a phenomenon known as "hallucination"<n>This paper introduces a comprehensive hallucination benchmark, incorporating both new extrinsic and existing intrinsic evaluation tasks.
arXiv Detail & Related papers (2025-04-24T13:40:27Z) - Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations [82.42811602081692]
This paper introduces a subsequence association framework to systematically trace and understand hallucinations.<n>Key insight is hallucinations that arise when dominant hallucinatory associations outweigh faithful ones.<n>We propose a tracing algorithm that identifies causal subsequences by analyzing hallucination probabilities across randomized input contexts.
arXiv Detail & Related papers (2025-04-17T06:34:45Z) - A Cause-Effect Look at Alleviating Hallucination of Knowledge-grounded Dialogue Generation [51.53917938874146]
We propose a possible solution for alleviating the hallucination in KGD by exploiting the dialogue-knowledge interaction.
Experimental results of our example implementation show that this method can reduce hallucination without disrupting other dialogue performance.
arXiv Detail & Related papers (2024-04-04T14:45:26Z) - Does the Generator Mind its Contexts? An Analysis of Generative Model
Faithfulness under Context Transfer [42.081311699224585]
The present study introduces the knowledge-augmented generator, which is specifically designed to produce information that remains grounded in contextual knowledge.
Our objective is to explore the existence of hallucinations arising from parametric memory when contextual knowledge undergoes changes.
arXiv Detail & Related papers (2024-02-22T12:26:07Z) - Do Androids Know They're Only Dreaming of Electric Sheep? [45.513432353811474]
We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior.
Our probes are narrowly trained and we find that they are sensitive to their training domain.
We find that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available.
arXiv Detail & Related papers (2023-12-28T18:59:50Z) - DeltaScore: Fine-Grained Story Evaluation with Perturbations [69.33536214124878]
We introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story aspects.
Our central proposition posits that the extent to which a story excels in a specific aspect (e.g., fluency) correlates with the magnitude of its susceptibility to particular perturbations.
We measure the quality of an aspect by calculating the likelihood difference between pre- and post-perturbation states using pre-trained language models.
arXiv Detail & Related papers (2023-03-15T23:45:54Z) - Inspecting the Factuality of Hallucinated Entities in Abstractive
Summarization [36.052622624166894]
State-of-the-art abstractive summarization systems often generate emphhallucinations; i.e., content that is not directly inferable from the source text.
We propose a novel detection approach that separates factual from non-factual hallucinations of entities.
arXiv Detail & Related papers (2021-08-30T15:40:52Z) - Tortured phrases: A dubious writing style emerging in science. Evidence
of critical issues affecting established journals [69.76097138157816]
Probabilistic text generators have been used to produce fake scientific papers for more than a decade.
Complex AI-powered generation techniques produce texts indistinguishable from that of humans.
Some websites offer to rewrite texts for free, generating gobbledegook full of tortured phrases.
arXiv Detail & Related papers (2021-07-12T20:47:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.