Fugu-MT 論文翻訳(概要): How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

論文の概要: How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

arxiv url: http://arxiv.org/abs/2510.06700v1
Date: Wed, 08 Oct 2025 06:48:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.338249
Title: How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects
Title（参考訳）: 言語モデルが論理的妥当性と可塑性をどう相関させるか:内容効果の表現論的分析
Authors: Leonardo Bertolazzi, Sandro Pezzelle, Raffaelle Bernardi,
Abstract要約: 人間と大規模言語モデル(LLM)は、内容効果を示す: 推論問題の意味的内容の妥当性が、その論理的妥当性に関する判断に影響を与えるバイアス。両概念が線形に表現され、表現幾何学に強く整合していることが示され、モデルが妥当性と妥当性を詳述する。ステアリングベクトルを用いて、確率ベクトルは因果バイアスの妥当性判定が可能であり、その逆も可能であり、これらの2つの概念間のアライメントの程度は、モデル間での行動内容の影響の大きさを予測することを実証する。
参考スコア（独自算出の注目度）: 6.503236297532475
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.
Abstract（参考訳）: 人間と大規模言語モデル(LLM)は、内容効果を示す: 推論問題の意味的内容の妥当性が、その論理的妥当性に関する判断に影響を与えるバイアス。ヒトにおけるこの現象は、推論の二重過程理論によって最もよく説明されているが、LLMにおける内容効果のメカニズムはいまだ不明である。本研究では,LLMが内部表現の妥当性と妥当性をエンコードする方法を検討することにより,この問題に対処する。両概念が線形に表現され、表現幾何学に強く整合していることが示され、モデルが妥当性と妥当性を詳述する。ステアリングベクターを用いて、確率ベクトルは因果バイアスの妥当性判定が可能であり、その逆も可能であり、これらの2つの概念間のアライメントの程度がモデル間での行動内容効果の程度を予測することを実証する。最後に,これらの概念を乱し,内容効果を低減し,推論精度を向上させる脱バイアスベクトルを構築した。本研究は,LLMにおける抽象的論理概念の表現方法の理解を深め,より論理的なシステムへの道筋として表現的介入を強調した。

論文の概要: How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

関連論文リスト