Fugu-MT 論文翻訳(概要): Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

論文の概要: Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

arxiv url: http://arxiv.org/abs/2605.25052v1
Date: Sun, 24 May 2026 12:57:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.733712
Title: Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
Title（参考訳）: Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth
Authors: Yoav Gur-Arieh, Ana Marasović, Mor Geva,
Abstract要約: 思考の連鎖(CoT)は、大規模言語モデルの解釈と監査行動の中心となっている。ステップレベルとCoTレベルの両方で、地道忠実度ラベルを出力する自動ラベリングパイプラインを開発した。実験の結果,ほとんどの測定値が近い確率で動作し,予測バイアスが強く,CoTが長くなると劣化することがわかった。
参考スコア（独自算出の注目度）: 24.21103008618097
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chains of thought (CoTs) have become central in interpreting and auditing behaviors of large language models. Yet growing evidence suggests that these traces often fail to faithfully represent the computations behind a model's predictions. Several faithfulness metrics have been proposed, but whether they indeed measure faithfulness remains unknown. Answering this requires ground-truth labels, which are hard to obtain since internal computations are not directly observable. Consequently, most works proposing metrics report only absolute scores or comparisons to prior metrics, and the few existing benchmarks rely on proxies like plausibility or importance, properties orthogonal to faithfulness that can mislead about whether a CoT can be trusted. We address this challenge by constructing tasks whose outputs reveal which intermediate computations must have produced them, and developing an automated labeling pipeline that yields ground-truth faithfulness labels at both the step and CoT level. Building on this methodology, we present BonaFide, a benchmark of 3,066 labeled CoTs across 13 tasks and 10 models, and use it to conduct the first systematic evaluation of prominent faithfulness metrics. Our experiments show that most metrics perform near chance, exhibit strong prediction biases and degrade on longer CoTs. The best metric reaches only 0.70 AUROC at the CoT level while another reaches 0.59 at the step level, with neither transferring across settings, while entailing prohibitively high computational cost. Our results expose fundamental gaps in current faithfulness evaluation and call for the development of more reliable and efficient metrics.
Abstract（参考訳）: 思考の連鎖(CoT)は、大規模言語モデルの解釈と監査行動の中心となっている。しかし、増大する証拠は、これらのトレースがしばしばモデルの予測の背後にある計算を忠実に表現できないことを示唆している。いくつかの忠実度指標が提案されているが、それらが実際に忠実度を測定するかどうかは不明である。内部計算は直接観測できないので入手が困難である。既存のベンチマークでは、CoTが信頼できるかどうかを誤解させる可能性のある、信頼性や信頼性に直交するプロパティである、可視性や重要性といったプロキシに依存しています。この課題は,どの中間計算が生成しなければならないかを出力するタスクを構築し,ステップレベルとCoTレベルの両方で真真正性ラベルを出力する自動ラベリングパイプラインを開発することで解決される。この方法論に基づいて、13のタスクと10のモデルにわたる3,066のラベル付きCoTのベンチマークであるBonaFideを紹介します。実験の結果,ほとんどの測定値が近い確率で動作し,予測バイアスが強く,CoTが長くなると劣化することがわかった。最高基準は CoT レベルで 0.70 AUROC に達し、もう1つはステップレベルで 0.59 に達している。その結果、現在の忠実度評価における根本的なギャップが明らかとなり、より信頼性が高く効率的なメトリクスの開発が求められている。

論文の概要: Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

関連論文リスト