Fugu-MT 論文翻訳(概要): Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement

論文の概要: Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement

arxiv url: http://arxiv.org/abs/2510.09738v1
Date: Fri, 10 Oct 2025 17:27:33 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.603379
Title: Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement
Title（参考訳）: 裁判官の評決:人的合意によるLLM裁判官の能力に関する総合的分析
Authors: Steve Han, Gilberto Titericz Junior, Tom Balough, Wenfei Zhou,
Abstract要約: 本研究では,Large Language Models (LLMs) を応答精度評価タスクの判定対象として,新たな2段階評価手法を提案する。 RAG(Retrieval-Augmented Generation)やAgentic Pipelines(Agentic Pipelines)からの応答を、地上の真実の答えに対して評価すると、54個のLLMが人間の判断をいかにうまく再現できるかを評価する。
参考スコア（独自算出の注目度）: 1.5191981795942073
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This research introduces the Judge's Verdict Benchmark, a novel two-step methodology to evaluate Large Language Models (LLMs) as judges for response accuracy evaluation tasks. We assess how well 54 LLMs can replicate human judgment when scoring responses from RAG (Retrieval-Augmented Generation) or Agentic pipelines against ground truth answers. Our methodology progresses from traditional correlation analysis to comprehensive Cohen's Kappa analysis that measures actual agreement patterns. The two-step approach includes: (1) a correlation test that filters judges with strong alignment, followed by (2) a human-likeness test using z-scores to identify two distinct judgment patterns: human-like judgment (|z| < 1) that mimics natural human variation, and super-consistent judgment (z > 1) that exceeds typical human-to-human agreement levels. This methodology reveals that 27 out of 54 tested LLMs achieve Tier 1 performance: 23 models exhibit human-like patterns that preserve the nuances of human judgment, while 4 models demonstrate super-consistent behavior, a pattern that could indicate either enhanced reliability or oversimplification of complex judgments. Testing 43 open-source models (1B-405B parameters) and 11 closed models (GPT, Gemini, Claude variants), we demonstrate that judge excellence is not solely dependent on model size but on specific training strategies. Our key contributions include: (1) establishing that correlation alone is insufficient for judge evaluation, (2) introducing a "Turing Test for judges" based on agreement patterns, and (3) providing a standardized benchmark for classifying LLM judges into distinct performance tiers for different evaluation needs.
Abstract（参考訳）: 本研究は,Large Language Models (LLMs) を応答精度評価タスクのジャッジとして評価する,新しい2段階の手法であるジャッジの検証ベンチマークを紹介する。 RAG(Retrieval-Augmented Generation)やAgentic Pipelines(Agentic Pipelines)からの応答を、地上の真実の答えに対して評価すると、54個のLLMが人間の判断をいかにうまく再現できるかを評価する。我々の手法は、従来の相関分析から、実際の合意パターンを測定するCohenのKappa分析まで進歩している。 2段階のアプローチでは,(1) 強いアライメントで判断をフィルタリングする相関テスト,(2) zスコアを用いた人間類似性テスト,(2) 人間の自然な変化を模倣する人間類似性判定(|z| < 1) と,(z > 1) 通常の人間対人間の合意レベルを超える超一貫性判定(z) の2つの異なる判断パターンを識別する。 23のモデルは人間の判断のニュアンスを保った人間のようなパターンを示し、4つのモデルは超一貫性のある行動を示し、それは信頼性の向上または複雑な判断の過度な単純化を示す。 43個のオープンソースモデル (1B-405Bパラメータ) と11個のクローズドモデル (GPT, Gemini, Claude variants) を検証したところ, 判定精度はモデルサイズにのみ依存せず, 特定のトレーニング戦略に依存していることがわかった。主な貢献は,(1) 判断評価に相関だけでは不十分であること,(2) 合意パターンに基づいた「判断のための学習試験」を導入すること,(3) 異なる評価ニーズに対して,LLM判断を異なる性能レベルに分類するための標準化されたベンチマークを提供することである。

論文の概要: Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement

関連論文リスト