Fugu-MT 論文翻訳(概要): RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

論文の概要: RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

arxiv url: http://arxiv.org/abs/2603.25133v1
Date: Thu, 26 Mar 2026 07:55:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.168156
Title: RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
Title（参考訳）: RubricEval: LLM審査員の指示追従のためのルーブリックレベルメタ評価ベンチマーク
Authors: Tianjun Pan, Xuan Lin, Wenyan Yang, Qianyu He, Shisong Chen, Licai Qi, Wanqing Xu, Hongwei Feng, Bo Xu, Yanghua Xiao,
Abstract要約: 大規模言語モデル(LLM)における命令追従の評価において,ルーブリックレベルの評価が主流となっている本ベンチマークでは,(1)命令追従のための最初のルーリックレベルのメタ評価ベンチマーク,(2)複数のインスタンスとモデルソースにまたがる多様な命令と応答,(3)判定性能をよりよく区別する3,486個の品質管理サブセットの相当セットを特徴とする。評価パラダイムを考慮すると、チェックリストレベルの評価はルーブリックレベルよりも優れており、明示的推論は精度を向上し、両者ともにジャッジ間の分散を減少させる。
参考スコア（独自算出の注目度）: 46.45323577110897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rubric-based evaluation has become a prevailing paradigm for evaluating instruction following in large language models (LLMs). Despite its widespread use, the reliability of these rubric-level evaluations remains unclear, calling for meta-evaluation. However, prior meta-evaluation efforts largely focus on the response level, failing to assess the fine-grained judgment accuracy that rubric-based evaluation relies on. To bridge this gap, we introduce RubricEval. Our benchmark features: (1) the first rubric-level meta-evaluation benchmark for instruction following, (2) diverse instructions and responses spanning multiple categories and model sources, and (3) a substantial set of 3,486 quality-controlled instances, along with Easy/Hard subsets that better differentiates judge performance. Our experiments reveal that rubric-level judging remains far from solved: even GPT-4o, a widely adopted judge in instruction-following benchmarks, achieves only 55.97% on Hard subset. Considering evaluation paradigm, rubric-level evaluation outperforms checklist-level, explicit reasoning improves accuracy, and both together reduce inter-judge variance. Through our established rubric taxonomy, we further identify common failure modes and offer actionable insights for reliable instruction-following evaluation.
Abstract（参考訳）: 大規模言語モデル (LLM) において, ルーブリックに基づく評価が指示に従うためのパラダイムとして主流となっている。広く使われているにもかかわらず、これらのルーリックレベルの評価の信頼性は未だ不明であり、メタ評価が求められている。しかし、以前のメタ評価の取り組みは、主に応答レベルに焦点を当てており、ルーリックに基づく評価が依存するきめ細かい判断精度を評価できなかった。このギャップを埋めるために、RubricEvalを紹介します。提案するベンチマークでは,(1)命令に対する最初のルーリックレベルのメタ評価ベンチマーク,(2)複数のカテゴリやモデルソースにまたがる多様な命令と応答,(3)品質管理された3,486のインスタンス,(3)判定性能をよりよく区別するEasy/Hardサブセットなどが特徴である。命令追従ベンチマークで広く採用されているGPT-4oでさえ、ハードサブセットでは55.97%しか得られていない。評価パラダイムを考慮すると、ルーリックレベルの評価はチェックリストレベルよりも優れており、明示的推論は精度を向上し、両者の相違が軽減される。確立した古風な分類法を通じて、一般的な失敗モードを更に特定し、信頼性の高い指示追従評価のための実用的な洞察を提供する。

論文の概要: RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following

関連論文リスト