Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs
- URL: http://arxiv.org/abs/2501.00273v1
- Date: Tue, 31 Dec 2024 04:54:48 GMT
- Title: Echoes in AI: Quantifying Lack of Plot Diversity in LLM Outputs
- Authors: Weijia Xu, Nebojsa Jojic, Sudha Rao, Chris Brockett, Bill Dolan,
- Abstract summary: We examine two state-of-the-art large language models (LLMs) on story generation.
We find that LLM-generated stories often consist of plot elements that are echoed across a number of generations.
We introduce the Sui Generis score, which estimates how unlikely a plot element is to appear in alternative storylines.
- Score: 26.682827310724363
- License:
- Abstract: With rapid advances in large language models (LLMs), there has been an increasing application of LLMs in creative content ideation and generation. A critical question emerges: can current LLMs provide ideas that are diverse enough to truly bolster the collective creativity? We examine two state-of-the-art LLMs, GPT-4 and LLaMA-3, on story generation and discover that LLM-generated stories often consist of plot elements that are echoed across a number of generations. To quantify this phenomenon, we introduce the Sui Generis score, which estimates how unlikely a plot element is to appear in alternative storylines generated by the same LLM. Evaluating on 100 short stories, we find that LLM-generated stories often contain combinations of idiosyncratic plot elements echoed frequently across generations, while the original human-written stories are rarely recreated or even echoed in pieces. Moreover, our human evaluation shows that the ranking of Sui Generis scores among story segments correlates moderately with human judgment of surprise level, even though score computation is completely automatic without relying on human judgment.
Related papers
- We're Different, We're the Same: Creative Homogeneity Across LLMs [6.532204241949196]
Large language models (LLMs) are now available for use as writing support tools, idea generators, and beyond.
Several works have shown that using an LLM as a creative partner results in a narrower set of creative outputs.
arXiv Detail & Related papers (2025-01-31T18:12:41Z) - A Causality-aware Paradigm for Evaluating Creativity of Multimodal Large Language Models [100.16387798660833]
Oogiri game is a creativity-driven task requiring humor and associative thinking.
LoTbench is an interactive, causality-aware evaluation framework.
Results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable.
arXiv Detail & Related papers (2025-01-25T09:11:15Z) - LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models [96.64960606650115]
LongHalQA is an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text.
LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios.
arXiv Detail & Related papers (2024-10-13T18:59:58Z) - Assessing Language Models' Worldview for Fiction Generation [0.0]
This study investigates the ability of Large Language Models to maintain a state of world essential to generate fiction.
We find that only two models exhibit consistent worldview, while the rest are self-conflicting.
This uniformity across models further suggests a lack of state' necessary for fiction.
arXiv Detail & Related papers (2024-08-15T03:19:41Z) - Are Large Language Models Capable of Generating Human-Level Narratives? [114.34140090869175]
This paper investigates the capability of LLMs in storytelling, focusing on narrative development and plot progression.
We introduce a novel computational framework to analyze narratives through three discourse-level aspects.
We show that explicit integration of discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling.
arXiv Detail & Related papers (2024-07-18T08:02:49Z) - The Unlikely Duel: Evaluating Creative Writing in LLMs through a Unique Scenario [12.852843553759744]
We evaluate recent state-of-the-art, instruction-tuned large language models (LLMs) on an English creative writing task.
We use a specifically-tailored prompt (based on an epic combat between Ignatius J. Reilly and a pterodactyl) to minimize the risk of training data leakage.
evaluation is performed by humans using a detailed rubric including various aspects like fluency, style, originality or humor.
arXiv Detail & Related papers (2024-06-22T17:01:59Z) - Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation [15.718288693929019]
Large Language Models (LLM) achieve state-of-the-art performance on many NLP tasks.
We study whether LLMs can be used as substitutes for human annotators.
We find that LLMs outperform current automatic measures for system-level evaluation but still struggle to provide satisfactory explanations.
arXiv Detail & Related papers (2024-05-22T15:56:52Z) - Rethinking Interpretability in the Era of Large Language Models [76.1947554386879]
Large language models (LLMs) have demonstrated remarkable capabilities across a wide array of tasks.
The capability to explain in natural language allows LLMs to expand the scale and complexity of patterns that can be given to a human.
These new capabilities raise new challenges, such as hallucinated explanations and immense computational costs.
arXiv Detail & Related papers (2024-01-30T17:38:54Z) - A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative
Writing [0.0]
We evaluate recent LLMs on English creative writing, a challenging and complex task that requires imagination, coherence, and style.
We ask several LLMs and humans to write such a story and conduct a human evalution involving various criteria such as originality, humor, and style.
Our results show that some state-of-the-art commercial LLMs match or slightly outperform our writers in most dimensions; whereas open-source LLMs lag behind.
arXiv Detail & Related papers (2023-10-12T15:56:24Z) - Summarization is (Almost) Dead [49.360752383801305]
We develop new datasets and conduct human evaluation experiments to evaluate the zero-shot generation capability of large language models (LLMs)
Our findings indicate a clear preference among human evaluators for LLM-generated summaries over human-written summaries and summaries generated by fine-tuned models.
arXiv Detail & Related papers (2023-09-18T08:13:01Z) - DoLa: Decoding by Contrasting Layers Improves Factuality in Large
Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining.
We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs.
We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.