Fugu-MT 論文翻訳(概要): Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

論文の概要: Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

arxiv url: http://arxiv.org/abs/2603.22214v1
Date: Mon, 23 Mar 2026 17:12:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.806373
Title: Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models
Title（参考訳）: 大規模言語モデルの自動判断システムの信頼性と忠実度の評価
Authors: Tom Biskupski, Stephan Kleber,
Abstract要約: 審査員としてのLarge Language Model(LLM)は、被害者の機械学習(ML)モデル、特にLLMの品質を、その出力を分析して評価する。審査員としてのLLMは、まったく新しい技術であるため、信頼性と人間の判断への同意について徹底的な調査を欠いている。我々は,37種類の対話型LLMと5つの異なる判断プロンプト,第2レベルの判断概念,およびタスクを評価対象として微調整した5つのモデルとの併用の有効性を検証した。
参考スコア（独自算出の注目度）: 0.20052993723676893
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: A Large Language Model (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis. The resulting automation of the analysis scales up the complex evaluation of the victim models' free-form text outputs by faster and more consistent judgments compared to human reviewers. Thus, quality and security assessments of LLMs can cover a wide range of the victim models' use cases. Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment. Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs. We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors. As assessment objective, we curate datasets for eight different categories of judgment tasks and the corresponding ground-truth labels based on human assessments. Our empirical results show a high correlation of LLMs as judges with human assessments, when combined with a suitable prompt, in particular for GPT-4o, several open-source models with $\geqslant$ 32B parameters, and a few smaller models like Qwen2.5 14B.
Abstract（参考訳）: 審査員としてのLarge Language Model(LLM)は、被害者の機械学習(ML)モデル、特にLLMの品質を、その出力を分析して評価する。審査員としてのLLMは、分析の基準を含む1つのモデルと1つの特別に設計された審査プロンプトの組み合わせである。分析結果の自動化は、人間のレビュアーと比較して、より高速で一貫性のある判断によって、被害者モデルの自由形式のテキスト出力の複雑な評価をスケールアップする。したがって、LSMの品質とセキュリティの評価は、犠牲者モデルの幅広いユースケースをカバーすることができる。審査員としてのLLMは、まったく新しい技術であるため、信頼性と人間の判断への同意について徹底的な調査を欠いている。本研究は, LLMの自動品質評価装置として, LLMの適用性を評価する。我々は,37種類の対話型LLMと5つの異なる判断プロンプト,第2レベルの判断概念,およびタスクを評価対象として微調整した5つのモデルとの併用の有効性を検証した。評価対象として,人的評価に基づいて,評価課題の8つのカテゴリと,それに対応する地味ラベルのデータセットをキュレートする。特に GPT-4o や$\geqslant$ 32B のオープンソースモデル,Qwen2.5 14B などの小型モデルなどにおいて,人間の評価と LLM の相関性が高いことを示す。

論文の概要: Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

関連論文リスト