Fugu-MT 論文翻訳(概要): How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

論文の概要: How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

arxiv url: http://arxiv.org/abs/2604.22271v2
Date: Fri, 01 May 2026 09:11:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 13:37:10.816017
Title: How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals
Title（参考訳）: LLMが自身のエラーを検知し、修正する方法:内部信頼信号の役割
Authors: Dharshan Kumaran, Viorica Patraucean, Simon Osindero, Petar Veličković, Nathaniel Daw,
Abstract要約: 大規模な言語モデルは、自身のエラーを検出し、時には外部からのフィードバックなしに修正することができる。我々は、決定神経科学からの信頼の2階モデルのレンズを通してこれを調査する。
参考スコア（独自算出の注目度）: 6.467495925520036
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models can detect their own errors and sometimes correct them without external feedback, but the underlying mechanisms remain unknown. We investigate this through the lens of second-order models of confidence from decision neuroscience. In a first-order system, confidence derives from the generation signal itself and is therefore maximal for the chosen response, precluding error detection. Second-order models posit a partially independent evaluative signal that can disagree with the committed response, providing the basis for error detection. Kumaran et al. (2026) showed that LLMs cache a confidence representation at a token immediately following the answer (i.e. post-answer newline: PANL) -- that causally drives verbal confidence and dissociates from log-probabilities. Here we test whether this PANL signal extends beyond confidence to support error detection and self-correction. Here we test whether this signal supports error detection and self-correction, deriving predictions from the second-order framework. Using a verify-then-correct paradigm, we show that: (i) verbal confidence predicts error detection far beyond token log-probabilities, ruling out a first-order account; (ii) PANL activations predict error detection beyond verbal confidence itself; and (iii) PANL predicts which errors the model can correct -- where all behavioural signals fail. Causal interventions confirm that PANL signals rescue error detection behavior when answer information is corrupted. All findings replicate across models (Gemma 3 27B and Qwen 2.5 7B) and tasks (TriviaQA and MNLI). These results reveal that LLMs naturally implement a second-order confidence architecture whose internal evaluative signal encodes not only whether an answer is likely wrong but whether the model has the knowledge to fix it.
Abstract（参考訳）: 大規模な言語モデルは、自身のエラーを検出し、時には外部からのフィードバックなしに修正することができるが、基礎となるメカニズムはいまだ不明である。我々は、決定神経科学からの信頼の2階モデルのレンズを通してこれを調査する。 1次システムでは、信頼度は生成信号自体から導出され、したがってエラー検出を先立って選択された応答に対して最大となる。 2階モデルは、部分的に独立した評価信号を示し、コミットされた応答に反し、エラー検出の基盤を提供する。 Kumaran et al (2026) は、LLMが回答の直後のトークン(すなわち、回答後ニューライン:PANL)で信頼表現をキャッシュし、言語的信頼を因果的に推進し、対数確率から解離することを示した。ここでは、このPANL信号が信頼を超えてエラー検出と自己補正をサポートするかどうかを検証する。ここでは、この信号がエラー検出および自己補正をサポートするかどうかを検証し、第2次フレームワークからの予測を導出する。検証-then-correct パラダイムを用いることで、次のように示します。 i) 動詞の信頼度は、トークンログの確率をはるかに超えてエラー検出を予測し、一階の口座を除外する。 (二)PANLアクティベーションは、言語的信頼そのものを超えた誤り検出を予測し、 (iii)PANLはモデルがどのエラーを修正できるかを予測します。因果的介入により、PANLは回答情報が破損した場合の救難エラー検出行動を通知する。全ての結果は、モデル(Gemma 3 27B, Qwen 2.5 7B)とタスク(TriviaQA, MNLI)で再現される。これらの結果から, LLMは内部評価信号が解答が誤りであるかどうかだけでなく, モデルに修正の知識があるかどうかを符号化する二階信頼アーキテクチャを自然に実装していることが明らかとなった。

論文の概要: How LLMs Detect and Correct Their Own Errors: The Role of Internal Confidence Signals

関連論文リスト