Fugu-MT 論文翻訳(概要): Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems

論文の概要: Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems

arxiv url: http://arxiv.org/abs/2512.01786v1
Date: Mon, 01 Dec 2025 15:26:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-02 19:46:34.922872
Title: Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems
Title（参考訳）: 裁判官は誰だ? LLM審査員:信頼できるLLM評価システムの構築
Authors: Xiaochuan Li, Ke Wang, Girija Gouda, Shubham Choudhary, Yaqun Wang, Linwei Hu, Joel Vaughan, Freddy Lecue,
Abstract要約: スケーラブルでコンテキスト対応な評価のための動的学習ベースのフレームワークを提案する。本手法は,LLM審査員が人間専門家といつ一致するかを評価するために,信頼度予測器のセットを訓練する。要約およびRAGベンチマーク実験により,我々の動的陪審法は,単一判定基準と静的判定基準の両方よりも,人間の判断との相関が著しく高いことを示した。
参考スコア（独自算出の注目度）: 2.9141470183751674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) become integrated into high-stakes domains, there is a growing need for evaluation methods that are both scalable for real-time deployment and reliable for critical decision-making. While human evaluation is reliable, it is slow and costly. Single LLM judges are biased, and static juries lack adaptability. To overcome these limitations, we propose LLM Jury-on-Demand - a dynamic, learning-based framework for scalable and context-aware evaluation. Our method trains a set of reliability predictors to assess when LLM judges will agree with human experts, leveraging token distributions, embeddings, and structural input features. This enables a fully adaptive evaluation where, for each data point, an optimal jury of the most reliable judges is dynamically selected, and their scores are aggregated using their reliability as weights. Experiments on summarization and RAG benchmarks show that our dynamic jury system achieves significantly higher correlation with human judgment than both single-judge and static-jury baselines. These results highlight the promise of adaptive, learning-based juries for building scalable, more reliable and trustworthy evaluation systems for modern LLMs in high-stakes domains.
Abstract（参考訳）: 大規模言語モデル(LLM)がハイテイクドメインに統合されるにつれて、リアルタイムデプロイメントにスケーラブルで、重要な意思決定に信頼性のある評価方法の必要性が高まっています。人間の評価は信頼できるが、遅くてコストがかかる。単一のLSM審査員はバイアスを受けており、静的判定は適応性に欠ける。これらの制限を克服するために、スケーラブルでコンテキスト対応な評価のための動的学習ベースのフレームワークであるLLM Jury-on-Demandを提案する。提案手法は,LLM審査員がトークン分布,埋め込み,構造入力の特徴を利用して,人間の専門家にいつ同意するかを評価するための信頼性予測器のセットを訓練する。これにより、各データポイントに対して、最も信頼できる審査員の最適な審査員が動的に選択され、その信頼性を重みとしてスコアが集約される完全に適応的な評価が可能になる。要約およびRAGベンチマーク実験により,我々の動的陪審法は,単一判定基準と静的判定基準の両方よりも,人間の判断との相関が著しく高いことを示した。これらの結果は、高度領域における近代LLMのためのスケーラブルで信頼性が高く信頼性の高い評価システムを構築するための適応的、学習ベースのジャリーの約束を浮き彫りにしている。

論文の概要: Who Judges the Judge? LLM Jury-on-Demand: Building Trustworthy LLM Evaluation Systems

関連論文リスト