Fugu-MT 論文翻訳(概要): Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

論文の概要: Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

arxiv url: http://arxiv.org/abs/2601.05114v1
Date: Thu, 08 Jan 2026 17:02:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-09 17:01:53.294611
Title: Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior
Title（参考訳）: 評価指紋:LLM評価器の安定性と系統的差異
Authors: Wajid Nasser,
Abstract要約: 審査員は一貫性があるが、互いに一致していない。評価は3,240件を超え、中間合意はほぼゼロに近い。審査員の平均得点は、審査員の実際の値に該当しない合成判定を生成する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: LLM-as-judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 judges x 120 unique video x pack items x 3 independent runs), inter-judge agreement is near-zero (Krippendorff's α = 0.042). On two dimensions, judges disagree more than random noise would predict (α < 0). Yet this disagreement isn't chaos; it's structured. A classifier identifies which judge produced an evaluation with 77.1% accuracy from rubric scores alone, rising to 89.9% with disposition features. Within model families, the signal is even stronger: GPT-4.1 and GPT-5.2 are distinguishable with 99.6% accuracy. We call this the reliability paradox: judges cannot agree on what constitutes quality, yet their disagreement patterns are so stable they function as fingerprints. Each judge implements a distinct, stable theory of quality: an "evaluative disposition" that shapes how it interprets any rubric. We characterize these dispositions along multiple axes: harshness/leniency, dimension emphasis, within-judge stability (ICC), and evidence behavior (receipt validity, semantic linkage via NLI, and shotgun index). The implication is stark: LLM judges are not interchangeable instruments measuring a shared construct. They are distinct measurement devices, each encoding its own implicit theory of quality. Averaging their scores produces a synthetic verdict that corresponds to no judge's actual values.
Abstract（参考訳）: LLM-as-judgeシステムはスケーラブルで一貫した評価を約束する。裁判官は一貫性があるが、互いに一致していない。 3,240点以上の評価 (9 は x 120 のユニークなビデオ x パックアイテム x 3 の独立実行) と、Judge 間の合意はゼロに近い(クリッペンドルフの α = 0.042)。 2次元では、裁判官はランダムノイズが予測する(α < 0)以上のことには同意しない。しかし、この意見の相違はカオスではありません。分類器は、審査員がルーリックスコアのみから77.1%の精度で評価し、配置特徴により89.9%まで上昇したことを識別する。 GPT-4.1とGPT-5.2は99.6%の精度で識別可能である。私たちはこれを信頼性のパラドックスと呼んでいる: 裁判官は品質を構成するものについて同意できないが、彼らの不一致パターンは非常に安定しており、指紋として機能する。各裁判官は、どのルーブをどう解釈するかを形作る「評価的配置」という、はっきりした安定した品質理論を実装している。本研究は,複数の軸に沿った分布の特徴として,過酷さ・強度,寸法強調,内面安定(ICC),エビデンス行動(応答妥当性,NLIによる意味的結合,ショットガン指標)を挙げる。 LLM判事は共有構造を測定する交換可能な機器ではない。これらは異なる測定装置であり、それぞれが独自の暗黙的な品質理論を符号化している。スコアを平均化すると、裁判官の実際の値に該当しない合成判定が生成される。

論文の概要: Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

関連論文リスト