Related papers: The Imperfective Paradox in Large Language Models

The Imperfective Paradox in Large Language Models

URL: http://arxiv.org/abs/2601.09373v1
Date: Wed, 14 Jan 2026 10:57:16 GMT
Title: The Imperfective Paradox in Large Language Models
Authors: Bolei Ma, Yusuke Miyao,
Abstract summary: We investigate the Imperfective Paradox, where the past progressive aspect entails event realization for activities but not for accomplishments.<n>We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes.<n>We uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation.
Score: 19.058068907991277
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.

Related papers

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z)
Temporal Predictors of Outcome in Reasoning Language Models [0.0]
Chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning.<n>We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact.<n>Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens.
arXiv Detail & Related papers (2025-11-03T08:57:18Z)
Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning [31.08532996770416]
Large language models (LLMs) exhibit cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states.<n>We propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences.
arXiv Detail & Related papers (2025-10-09T09:07:31Z)
Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models [4.946483489399819]
Large Language Models (LLMs) are prone to hallucination, the generation of factually incorrect statements.<n>This work investigates the intrinsic, architectural origins of this failure mode through three primary contributions.
arXiv Detail & Related papers (2025-10-07T16:40:31Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [58.32070787537946]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Plausible-Parrots @ MSP2023: Enhancing Semantic Plausibility Modeling using Entity and Event Knowledge [1.6233244703352492]
We enhance the large language model (LLM) with fine-grained entity types, event types and their definitions extracted from an external knowledge base. The experimental results show the effectiveness of the injected knowledge on modeling semantic plausibility of events.
arXiv Detail & Related papers (2024-08-29T23:13:45Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge [59.22170796793179]
Transformers Language Models (TLMs) were tested on a benchmark for the textitdynamic estimation of thematic fit Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge.
arXiv Detail & Related papers (2021-07-22T20:52:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.