Fugu-MT 論文翻訳(概要): Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

論文の概要: Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

arxiv url: http://arxiv.org/abs/2601.08843v1
Date: Sun, 21 Dec 2025 05:22:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-25 16:54:51.666765
Title: Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness
Title（参考訳）: ルブリック系LCMグレーディング:アライメント,不確かさ,ロバスト性
Authors: Haotian Deng, Chris Farber, Jiyoon Lee, David Tang,
Abstract要約: ルーブリックをベースとした短問合せ学習における大規模言語モデルの性能を体系的に評価する。二つのタスクに対してアライメントは強いが、粗い粒度が増すにつれて劣化する。実験により、モデルが注射に抵抗性がある一方で、同義置換に敏感であることが判明した。
参考スコア（独自算出の注目度）: 4.129847064263056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated short-answer grading (ASAG) remains a challenging task due to the linguistic variability of student responses and the need for nuanced, rubric-aligned partial credit. While Large Language Models (LLMs) offer a promising solution, their reliability as automated judges in rubric-based settings requires rigorous assessment. In this paper, we systematically evaluate the performance of LLM-judges for rubric-based short-answer grading. We investigate three key aspects: the alignment of LLM grading with expert judgment across varying rubric complexities, the trade-off between uncertainty and accuracy facilitated by a consensus-based deferral mechanism, and the model's robustness under random input perturbations and adversarial attacks. Using the SciEntsBank benchmark and Qwen 2.5-72B, we find that alignment is strong for binary tasks but degrades with increased rubric granularity. Our "Trust Curve" analysis demonstrates a clear trade-off where filtering low-confidence predictions improves accuracy on the remaining subset. Additionally, robustness experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions. Our work provides critical insights into the capabilities and limitations of rubric-conditioned LLM judges, highlighting the importance of uncertainty estimation and robustness testing for reliable deployment.
Abstract（参考訳）: 学生の反応の言語的変動と、ニュアンス付きルーリックな部分クレジットの必要性により、自動短解答格付け (ASAG) は依然として困難な課題である。 Large Language Models (LLMs) は有望なソリューションを提供するが、ルリックベースの設定における自動判断者としての信頼性は厳格な評価を必要とする。本稿では,LLM-judges の性能評価を行った。 3つの重要な側面を考察する: LLMグレーティングのアライメントと様々なルービックな複雑さに対する専門家の判断のアライメント、コンセンサスに基づく推論機構によって促進される不確実性と精度のトレードオフ、ランダムな入力摂動と敵攻撃によるモデルのロバスト性。 SciEntsBankベンチマークとQwen 2.5-72Bを用いて、バイナリタスクにはアライメントが強いが、粗い粒度が増大すると劣化することがわかった。我々の「トラスト曲線」解析は、低信頼度予測をフィルタリングすることで残りのサブセットの精度が向上する明確なトレードオフを示す。さらに、ロバストネスの実験では、モデルが即発注入に対して弾力性がある一方で、同義置換に敏感であることが示されている。我々の研究は、信頼性の高いデプロイメントにおける不確実性評価とロバストネステストの重要性を強調し、ルリック条件付きLLM裁判官の能力と限界に関する重要な洞察を提供する。

論文の概要: Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness

関連論文リスト