Fugu-MT 論文翻訳(概要): Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

論文の概要: Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

arxiv url: http://arxiv.org/abs/2605.06283v1
Date: Thu, 07 May 2026 13:55:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.865228
Title: Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement
Title（参考訳）: ルーブリック改質の統計的影響の定量化
Authors: Jessica Huynh, Alfredo Gomez, Athiya Deviyani, Renee Shelby, Jeffrey P. Bigham, Fernando Diaz,
Abstract要約: オートレーダは、評価と自動モデレーションコンテンツにますます使われています。ヒトとオートレーダの両方に提示されるルーブリックの修正がスコアアグリーメントにどのように影響するかは、統計学的に限定されている。
参考スコア（独自算出の注目度）: 50.34437999083224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoraters, also referred to as LLM-as-judges, are increasingly used for evaluation and automated content moderation. However, there is limited statistical analysis of how modifications in a rubric presented to both humans and autoraters affect their score agreement. Rubrics that ask for an overall or \emph{holistic} judgment - for example, rating the ``quality'' of an essay - may be inconsistently interpreted due to the complexity or subjectivity of the criteria. Conversely, rubrics can ask for \emph{analytic} judgments, which decompose assessment criteria - for example, ``quality'' into ``fluency'' and ``organization''. While these rubrics can be edited to improve the individual accuracy of both human and automated scoring, this approach may result in disagreement between the two scores, or with the associated holistic judgment. Designing and deploying reliable autoraters requires understanding not just the relationship between human and autorater annotations but how that relationship changes as holistic or analytic judgments are elicited. The results indicate that rubric edits providing representative examples and additional context, and reducing positional bias in the rubric increased human-autorater agreement, while higher rubric complexity and conservative aggregation methods tended to decrease it. The findings from the automatic essay scoring and instruction-following evaluation domains suggest that practitioners should carefully analyze domain- and rubric-specific performance to move towards higher human-autorater agreement.
Abstract（参考訳）: LLM-as-judgesとも呼ばれるAutoratersは、評価と自動コンテンツモデレーションにますます利用されている。しかしながら、人間とオートレーダの両方に提示されるルーブの修正がスコアアグリーメントにどのように影響するかについては、統計的に限定的な分析がある。例えば、エッセイの『品質』を評価できるような、全体的または‘emph{holistic} な判断を求める学者は、基準の複雑さや主観性のために、矛盾なく解釈されるかもしれない。逆に、ルーブリックは評価基準を分解する『emph{analytic}』の判断を求め、例えば『quality』を『fluency』と『organization』に分解する。これらのルーリックは人間と自動スコアの両方の個々の精度を改善するために編集できるが、このアプローチは2つのスコアの相違、あるいは関連する総観的な判断に繋がる可能性がある。信頼性の高いオートレーダの設計とデプロイには、人間とオートレータアノテーションの関係だけでなく、その関係が全体的あるいは分析的な判断によってどのように変化するかを理解する必要がある。その結果, ルーブリック編集は, 代表的な例と追加の文脈を提供し, ルーブリックの位置偏差を減少させ, ルーブリックの複雑度が高く, 保守的な集約法が減少する傾向にあった。自動エッセイ評価と指示追従評価ドメインから得られた知見から, 実践者は, ドメインやルーリック固有のパフォーマンスを慎重に分析し, より高い人間・オートレータ合意に向けて進めるべきであることが示唆された。

論文の概要: Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

関連論文リスト