Fugu-MT 論文翻訳(概要): Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

論文の概要: Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

arxiv url: http://arxiv.org/abs/2605.05957v2
Date: Fri, 08 May 2026 12:09:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 16:31:23.036629
Title: Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
Title（参考訳）: 未知の訂正: LLMにおける現実的訂正を抑圧するルーチンタスク要求
Authors: Zixuan Chen, Hao Lin, Zizhe Chen, Yizhou Tian, Garry Yang, Depeng Wang, Ya Guo, Huijia Zhu, James Cheng,
Abstract要約: LLMは、独立して提示された時に確実に偽のクレームを訂正するが、同じクレームがタスク指向のリクエストに埋め込まれている場合、そのクレームは正しいというよりも、従うことが多い。我々は、この障害モードの誤り訂正を抑え、300の偽の前提のベンチマークを構築し、8つのモデルで体系的に評価する。抑制率は19%から90%で、4つのモデルが80%を超え、修正抑制が一般的で深刻な現象として確立された。
参考スコア（独自算出の注目度）: 26.062372963777452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLMs reliably correct false claims when presented in isolation, yet when the same claims are embedded in task-oriented requests, they often comply rather than correct. We term this failure mode \emph{correction suppression} and construct a benchmark of 300 false premises to systematically evaluate it across eight models. Suppression rates range from 19\% to 90\%, with four models exceeding 80\%, establishing correction suppression as a prevalent and severe phenomenon. Mechanistic analysis reveals that suppression is not a knowledge failure: the model registers the error internally but task context diverts early-layer attention from the false claim as output intent crystallizes toward compliance at middle layers. We characterize this as \emph{knowing but not correcting} -- suppression occurs at response selection rather than knowledge encoding. Guided by this mechanism, we propose two training-free interventions. Correction Direction Steering (CDS) estimates a correction-compliance direction from matched pairs and injects it at middle layers before output intent crystallizes. Dynamic Payload Amplification (DPA) localizes payload tokens via attention divergence between early and late layers and amplifies their representation at the final layer, requiring no calibration data. Experiments on Qwen3.5-9B and LLaMA3.1-8B show both methods substantially improve factual strictness. CDS achieves the highest correction rate on Qwen3.5-9B (0\%$\to$58.2\%). DPA is the only method that preserves or improves reasoning capability on both models. These findings introduce \emph{factual strictness} -- the willingness to uphold accuracy against contextual pressures -- as a new dimension of model reliability.
Abstract（参考訳）: LLMは、独立して提示された時に確実に偽のクレームを訂正するが、同じクレームがタスク指向のリクエストに埋め込まれている場合、そのクレームは正しいというよりも、従うことが多い。この障害モードをemph{correction suppress}と呼び、300の誤った前提のベンチマークを構築し、8つのモデルで体系的に評価する。抑制率は 19 % から 90 % まで様々で、4 モデルが 80 % を超え、修正抑制が一般的で深刻な現象として確立された。モデルがエラーを内部に登録するが、タスクコンテキストは、中間層でのコンプライアンスに向けて出力インテントが結晶化するにつれて、初期のレイヤの注意を偽のクレームから逸脱させる。我々はこれを 'emph{knowing but not correcting} -- 知識符号化よりも応答選択で抑制が生じる。このメカニズムにより、トレーニングなしの2つの介入を提案する。補正方向ステアリング(CDS)は一致したペアから補正コンプライアンスの方向を推定し、出力インテントが結晶化する前に中間層に注入する。 Dynamic Payload Amplification (DPA)は、初期層と後期層の間の注意分散を通じてペイロードトークンをローカライズし、最終層での表現を増幅する。 Qwen3.5-9BとLLaMA3.1-8Bの実験では、どちらの手法も事実の厳密性を大幅に改善した。 CDSはQwen3.5-9B(0\%$\to$58.2\%)で最も高い補正率を達成する。 DPAは、両方のモデルで推論能力を保持または改善する唯一の方法である。これらの発見は、モデル信頼性の新しい次元として 'emph{factual strictness} -- 文脈的圧力に対する正確性を維持する意志 -- を導入している。

論文の概要: Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs

関連論文リスト