Fugu-MT 論文翻訳(概要): Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

論文の概要: Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

arxiv url: http://arxiv.org/abs/2603.20172v2
Date: Mon, 23 Mar 2026 21:10:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 12:42:17.584792
Title: Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Title（参考訳）: LLMチェイン・オブ・サート評価における分類器感度の計測方法に依拠する忠実度の測定
Authors: Richard J. Young,
Abstract要約: 連鎖忠実性に関する最近の研究は、単一集合数について報告している。本論文は、忠実性はモデルの客観的かつ測定可能な性質ではないことを示す。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper provides evidence that it is not. Three classifiers (a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce faithfulness rates of 74.4%, 82.6%, and 69.7%. Per-model gaps range from 2.6 to 30.6 percentage points; all pairwise McNemar tests are significant (p < 0.001). The disagreements are systematic: Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under Sonnet; OLMo-3.1-32B moves from 9th to 3rd. Different classifiers operationalize faithfulness at different levels of stringency (lexical mention versus epistemic dependence), yielding divergent measurements on the same behavior. These results indicate that published faithfulness numbers cannot be meaningfully compared across studies using different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies.
Abstract（参考訳）: チェーン・オブ・思想(CoT)の忠実性に関する最近の研究は、単一集合数(例えば、DeepSeek-R1は、その時間の39%を示唆している)を報告しており、忠実性はモデルの客観的で測定可能な性質であることを示唆している。この論文はそうではないという証拠を提供する。 3つの分類器 (regex-only detector, regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) を10,276に応用し、9つのファミリーと7Bから1Tパラメータにまたがる12のオープンウェイトモデルから導かれた推論トレースを解析した。同一のデータでは、これらの分類器の忠実度は74.4%、82.6%、69.7%である。モデルごとのギャップは2.6から30.6ポイントであり、全てのペアのマクネマール試験は有意である(p < 0.001)。コーエンのカッパは、シコファンシーのヒントの0.06 ("slight") から、グレーダーのヒントの0.42 ("moderate") までの範囲があり、非対称性は発音される。 Qwen3.5-27Bはパイプラインで1位、Sonnetでは7位、OLMo-3.1-32Bは9位から3位である。異なる分類器は、異なるレベルの寛大さ(語彙的言及と疫学的依存)で忠実さを運用し、同じ振る舞いに関する異なる測定結果をもたらす。これらの結果は, 異なる分類器を用いた研究において, 公表された忠実度は有意に比較できないこと, 今後の評価は, 複数の分類手法にまたがる感度範囲を報告するべきであることを示唆している。

論文の概要: Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

関連論文リスト