Related papers: How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

URL: http://arxiv.org/abs/2510.06700v1
Date: Wed, 08 Oct 2025 06:48:08 GMT
Title: How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects
Authors: Leonardo Bertolazzi, Sandro Pezzelle, Raffaelle Bernardi,
Abstract summary: Humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity.<n>We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity.<n>Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models.
Score: 6.503236297532475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.

Related papers

Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Cognitive Inception: Agentic Reasoning against Visual Deceptions by Injecting Skepticism [81.39177645864757]
We propose textbfInception, a fully reasoning-based agentic reasoning framework to conduct authenticity verification by injecting skepticism.<n>To the best of our knowledge, this is the first fully reasoning-based framework against AIGC visual deceptions.
arXiv Detail & Related papers (2025-11-21T05:13:30Z)
The Geometry of Reasoning: Flowing Logics in Representation Space [27.047532187192278]
We study how large language models (LLMs) think'' through their representation space.<n>We propose a novel geometric framework that models an LLM's reasoning as flows.
arXiv Detail & Related papers (2025-10-10T18:44:00Z)
LLM Assertiveness can be Mechanistically Decomposed into Emotional and Logical Components [0.17188280334580197]
Large Language Models (LLMs) often display overconfidence, presenting information with unwarranted certainty in high-stakes contexts.<n>We use open-sourced Llama 3.2 models fine-tuned on human annotated assertiveness datasets.<n>Our analysis identifies layers most sensitive to assertiveness contrasts and reveals that high-assertive representations decompose into two sub-components of emotional and logical clusters.
arXiv Detail & Related papers (2025-08-24T01:43:48Z)
Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers [76.42159902257677]
We argue that both behaviors stem from a single mechanism known as out-of-context reasoning (OCR)<n>OCR drives both generalization and hallucination, depending on whether the associated concepts are causally related.<n>Our work provides a theoretical foundation for understanding the OCR phenomenon, offering a new lens for analyzing and mitigating undesirable behaviors from knowledge injection.
arXiv Detail & Related papers (2025-06-12T16:50:45Z)
How do Transformers Learn Implicit Reasoning? [67.02072851088637]
We study how implicit multi-hop reasoning emerges by training transformers from scratch in a controlled symbolic environment.<n>We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures.
arXiv Detail & Related papers (2025-05-29T17:02:49Z)
Mitigating Content Effects on Reasoning in Language Models through Fine-Grained Activation Steering [14.298418197820912]
Large language models (LLMs) frequently demonstrate reasoning limitations, often conflating content plausibility with logical validity.<n>This can result in biased inferences, where plausible arguments are incorrectly deemed logically valid or vice versa.<n>This paper investigates the problem of mitigating content biases on formal reasoning through activation steering.
arXiv Detail & Related papers (2025-05-18T01:34:34Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales [102.54274021830207]
We introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. We filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability. Our approach also reduces hallucinations owing to its high correlation between images and text.
arXiv Detail & Related papers (2024-04-17T07:20:56Z)
Contrastive Reasoning in Neural Networks [26.65337569468343]
Inference built on features that identify causal class dependencies is termed as feed-forward inference. In this paper, we formalize the structure of contrastive reasoning and propose a methodology to extract a neural network's notion of contrast. We demonstrate the value of contrastively recognizing images under distortions by reporting an improvement of 3.47%, 2.56%, and 5.48% in average accuracy.
arXiv Detail & Related papers (2021-03-23T05:54:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.