Related papers: Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?

Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?

URL: http://arxiv.org/abs/2512.07777v1
Date: Mon, 08 Dec 2025 17:58:43 GMT
Title: Mary, the Cheeseburger-Eating Vegetarian: Do LLMs Recognize Incoherence in Narratives?
Authors: Karin de Langis, Püren Öncel, Ryan Peters, Andrew Elfenbein, Laura Kristen Allen, Andreas Schramm, Dongyeop Kang,
Abstract summary: We investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories.<n>LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives.
Score: 16.08138269588599
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Leveraging a dataset of paired narratives, we investigate the extent to which large language models (LLMs) can reliably separate incoherent and coherent stories. A probing study finds that LLMs' internal representations can reliably identify incoherent narratives. However, LLMs generate responses to rating questions that fail to satisfactorily separate the coherent and incoherent narratives across several prompt variations, hinting at a gap in LLM's understanding of storytelling. The reasoning LLMs tested do not eliminate these deficits, indicating that thought strings may not be able to fully address the discrepancy between model internal state and behavior. Additionally, we find that LLMs appear to be more sensitive to incoherence resulting from an event that violates the setting (e.g., a rainy day in the desert) than to incoherence arising from a character violating an established trait (e.g., Mary, a vegetarian, later orders a cheeseburger), suggesting that LLMs may rely more on prototypical world knowledge than building meaning-based narrative coherence. The consistent asymmetry found in our results suggests that LLMs do not have a complete grasp on narrative coherence.

Related papers

Critical Confabulation: Can LLMs Hallucinate for Social Good? [4.013184717814947]
We propose critical confabulation to fill-in-the-gap for omissions in archives due to social and political inequality.<n>We reconstruct divergent yet evidence-bound narratives for history's "hidden figures"<n>Our findings validate LLMs' foundational narrative understanding capabilities to perform critical confabulation.
arXiv Detail & Related papers (2025-11-11T01:02:35Z)
LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation [110.610512800947]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge.<n>Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage.
arXiv Detail & Related papers (2025-10-13T12:57:45Z)
Large Language Models Do NOT Really Know What They Don't Know [37.641827402866845]
Recent work suggests that large language models (LLMs) encode factuality signals in their internal representations.<n>LLMs can also produce factual errors by relying on shortcuts or spurious associations.
arXiv Detail & Related papers (2025-10-10T06:09:04Z)
Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts [79.1081247754018]
Large Language Models (LLMs) are widely deployed in reasoning, planning, and decision-making tasks.<n>We propose a framework based on Contact Searching Questions(CSQ) to quantify the likelihood of deception.
arXiv Detail & Related papers (2025-08-08T14:46:35Z)
SelfReflect: Can LLMs Communicate Their Internal Answer Distribution? [21.270758668026023]
We develop the SelfReflect metric, an information-theoretic distance between a summary and a distribution over answers.<n>We find that SelfReflect indicates even slight deviations, yielding a fine measure of faithfulness between a summary string and an LLM's actual internal distribution over answers.
arXiv Detail & Related papers (2025-05-26T17:59:53Z)
A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation [72.93327642336078]
We propose Belief Tree Propagation (BTProp), a probabilistic framework for hallucination detection.<n>BTProp introduces a belief tree of logically related statements by decomposing a parent statement into child statements.<n>Our method improves baselines by 3%-9% (evaluated by AUROC and AUC-PR) on multiple hallucination detection benchmarks.
arXiv Detail & Related papers (2024-06-11T05:21:37Z)
One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations [47.669923625184644]
Large Language Models (LLMs) are nondeterministic, the same input can generate different outputs, some of which may be incorrect or hallucinated. This study investigates how users perceive the AI model and comprehend the generated information when they receive multiple, potentially inconsistent, outputs.
arXiv Detail & Related papers (2024-05-09T07:12:45Z)
"Sorry, Come Again?" Prompting -- Enhancing Comprehension and Diminishing Hallucination with [PAUSE]-injected Optimal Paraphrasing [10.20632187568563]
Hallucination has emerged as the most vulnerable aspect of contemporary Large Language Models (LLMs) In this paper, we introduce the Sorry, Come Again (SCA) prompting, aimed to avoid LLM hallucinations. We provide an in-depth analysis of linguistic nuances: formality, readability, and concreteness of prompts for 21 LLMs. We propose an optimal paraphrasing technique to identify the most comprehensible paraphrase of a given prompt.
arXiv Detail & Related papers (2024-03-27T19:45:09Z)
AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models. It achieves consistent and correct step-wise prompts in zero-shot scenarios. We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z)
Are Large Language Models Temporally Grounded? [38.481606493496514]
We provide Large language models (LLMs) with textual narratives. We probe them with respect to their common-sense knowledge of the structure and duration of events. We evaluate state-of-the-art LLMs on three tasks reflecting these abilities.
arXiv Detail & Related papers (2023-11-14T18:57:15Z)
Improving Factual Consistency of News Summarization by Contrastive Preference Optimization [65.11227166319546]
Large language models (LLMs) generate summaries that are factually inconsistent with original articles.<n>These hallucinations are challenging to detect through traditional methods.<n>We propose Contrastive Preference Optimization (CPO) to disentangle the LLMs' propensities to generate faithful and fake content.
arXiv Detail & Related papers (2023-10-30T08:40:16Z)
Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? We propose KaRR, a statistical approach to assess factual knowledge for LLMs. Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z)
Event knowledge in large language models: the gap between the impossible and the unlikely [46.540380831486125]
We show that pre-trained large language models (LLMs) possess substantial event knowledge. They almost always assign higher likelihood to possible vs. impossible events. However, they show less consistent preferences for likely vs. unlikely events.
arXiv Detail & Related papers (2022-12-02T23:43:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.