Fugu-MT 論文翻訳(概要): Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

論文の概要: Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

arxiv url: http://arxiv.org/abs/2603.00077v1
Date: Fri, 13 Feb 2026 02:26:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.030077
Title: Autorubric: A Unified Framework for Rubric-Based LLM Evaluation
Title（参考訳）: Autorubric: ルブリックに基づくLLM評価のための統一フレームワーク
Authors: Delip Rao, Chris Callison-Burch,
Abstract要約: 大規模言語モデル(LLM)を評価するための統一フレームワークを提案する。この論文で提案されているオープンソースのPythonフレームワークであるAutorubricで、それぞれのテクニックが実現されている。 Autorubricは、重み付き二分、順序、および名目基準をサポートしており、多数派、重み付き、一対一、無投票のアグリゲーションによるシングルジャッジとマルチジャッジのアンサンブルの評価である。
参考スコア（独自算出の注目度）: 34.429649156970015
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's $κ$, weighted $κ$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework's key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.
Abstract（参考訳）: 大規模言語モデル(LLM)を用いた言語評価は,テキスト生成を大規模に評価する標準的な手法となっているが,その基礎となる手法は,矛盾する用語や部分解を持つ論文に分散している。この論文で提案するオープンソースのPythonフレームワークであるAutorubricでは,それぞれの識別されたテクニックが,その実現と組み合わせられている。オートルブリックは、構成可能な重み付き二分法、順序法および名目基準、多数、重み付き、一対一、および任意の投票による評価、判定平衡サンプリングによる少数ショットキャリブレーション、位置偏差(オプションシャッフル)、冗長性偏差(長いペナルティ)、基準衝突(自然言語説明による1基準原子評価)をサポートする。このフレームワークは、レスポンスキャッシング、再利用可能な実行によるチェックポイント、マルチプロファイラレート制限、コストトラッキングを含む、プロダクションインフラストラクチャとともに、サイコメトリック(Cohen氏の$κ$、重み付き$κ$、相関係数、および分散レベルテスト)から引き出された信頼性メトリクスを提供する。我々はAutorubricを教育評価、ディープリサーチ評価、チャットボット品質評価の3つのベンチマークで評価し、フレームワークの主要な能力を行使しながら、公開ベンチマークと一貫性のある結果が得られることを示した。ヘテロジニアスな基準のルーリック評価フレームワークをストレステストするために設計された,3つの基準タイプすべてにわたるサンプル単位の真実ラベルを持つ100サンプルのチャットボット評価データセットであるCHARM-100もコントリビュートする。

論文の概要: Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

関連論文リスト