Fugu-MT 論文翻訳(概要): Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

論文の概要: Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

arxiv url: http://arxiv.org/abs/2603.20172v1
Date: Fri, 20 Mar 2026 17:48:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:39.266154
Title: Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Title（参考訳）: LLMチェイン・オブ・サート評価における分類器感度の計測方法に依拠する忠実度の測定
Authors: Richard J. Young,
Abstract要約: 本稿では、忠実性はモデルの客観的な可測性ではないことを示す。 3つの分類器が10276に影響された推論トレースに適用される。全体の忠実度はそれぞれ74.4%、82.6%、69.7%であり、95%の信頼区間は重複しない。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar's test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.
Abstract（参考訳）: チェーン・オブ・思想(CoT)の忠実性に関する最近の研究は、単一集合数(例えば、DeepSeek-R1は、その時間の39%を示唆している)を報告しており、忠実性はモデルの客観的で測定可能な性質であることを示唆している。この論文はそうではないことを証明している。 3つの分類器 (regex-only detector, 2-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) は、9つのファミリーと7Bから1Tパラメータにまたがる12のオープンウェイトモデルの推論トレースに10,276に適用される。同一データでは、これらの分類器は、それぞれ74.4%、82.6%、69.7%の全体忠実度率を生成し、95%の信頼区間は重複しない。モデルごとのギャップは2.6から30.6ポイントの範囲で、全て統計的に有意である(McNemarのテスト、p < 0.001)。コーエンのカッパによって測定された分類間合意は、シコファンシーヒントの0.06 ("slight") からグレーダーヒントの0.42 ("moderate") までの範囲であり、非対称性は発音される。 Qwen3.5-27Bはパイプラインの1位だが、ソンネットの7位、OLMo-3.1-32Bは9位から3位まで反対方向に移動する。根本原因は、異なる分類器が関連する忠実な構成を異なる拘束レベル(語彙的言及と疫学的依存)で運用することであり、これらの構成物は同じ振る舞いについて異なる測定値が得られることである。これらの結果は,異なる分類器を用いた研究において,公表された忠実度を有意に比較することは不可能であり,今後の評価では,単一点推定ではなく,複数の分類手法の感度範囲を報告すべきであることを示す。

論文の概要: Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

関連論文リスト