Fugu-MT 論文翻訳(概要): How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

論文の概要: How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

arxiv url: http://arxiv.org/abs/2602.16039v1
Date: Tue, 17 Feb 2026 21:46:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-19 15:58:30.447896
Title: How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment
Title（参考訳）: グレードはどの程度不確かか? LLMによる自動評価のための不確かさ指標のベンチマーク
Authors: Hang Li, Kaiqi Yang, Xianxuan Long, Fedor Filippov, Yucheng Chu, Yasemin Copur-Gencturk, Peng He, Cory Miller, Namsoo Shin, Joseph Krajcik, Hui Liu, Jiliang Tang,
Abstract要約: 大規模言語モデル(LLM)の急速な普及は、教育における自動評価の展望を変えつつある。アウトプットの不確実性は自動評価において不可能な課題である。信頼性の低い、あるいは品質の低い不確実性推定は、下流での不安定な介入につながる可能性がある。
参考スコア（独自算出の注目度）: 30.331175047465408
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output formats, they also introduce new challenges related to output uncertainty, stemming from the inherently probabilistic nature of LLMs. Output uncertainty is an inescapable challenge in automatic assessment, as assessment results often play a critical role in informing subsequent pedagogical actions, such as providing feedback to students or guiding instructional decisions. Unreliable or poorly calibrated uncertainty estimates can lead to unstable downstream interventions, potentially disrupting students' learning processes and resulting in unintended negative consequences. To systematically understand this challenge and inform future research, we benchmark a broad range of uncertainty quantification methods in the context of LLM-based automatic assessment. Although the effectiveness of these methods has been demonstrated in many tasks across other domains, their applicability and reliability in educational settings, particularly for automatic grading, remain underexplored. Through comprehensive analyses of uncertainty behaviors across multiple assessment datasets, LLM families, and generation control settings, we characterize the uncertainty patterns exhibited by LLMs in grading scenarios. Based on these findings, we evaluate the strengths and limitations of different uncertainty metrics and analyze the influence of key factors, including model families, assessment tasks, and decoding strategies, on uncertainty estimates. Our study provides actionable insights into the characteristics of uncertainty in LLM-based automatic assessment and lays the groundwork for developing more reliable and effective uncertainty-aware grading systems in the future.
Abstract（参考訳）: 大規模言語モデル(LLM)の急速な普及は、教育における自動評価の展望を変えつつある。これらのシステムは、多様な質問タイプへの適応性と出力フォーマットの柔軟性を示す一方で、LLMの本質的に確率的な性質から、出力の不確実性に関連する新たな課題も導入している。アウトプットの不確実性は、自動評価において不可能な課題であり、評価結果は、学生にフィードバックを提供したり、指導的決定を導くなど、その後の教育的行動を伝える上で重要な役割を果たすことが多い。信頼性の低い、または品質の低い不確実性推定は、不安定な下流への介入を引き起こし、生徒の学習過程を妨害し、意図しない否定的な結果をもたらす可能性がある。この課題を体系的に理解し,今後の研究に報知するために,LLMに基づく自動評価の文脈において,幅広い不確実性定量化手法をベンチマークする。これらの手法の有効性は、他の領域にまたがる多くのタスクで実証されてきたが、教育環境における適用性と信頼性、特に自動階調においては、まだ未定のままである。複数の評価データセット、LLMファミリー、生成制御設定における不確実性行動の包括的解析を通じて、グレーディングシナリオにおいてLLMが示す不確実性パターンを特徴付ける。これらの結果に基づき、異なる不確実性指標の強度と限界を評価し、モデルファミリー、評価タスク、復号化戦略を含む重要な要因が不確実性推定に与える影響を分析する。本研究は, LLMに基づく自動評価における不確実性の特徴に関する実用的な知見を提供し, 今後, より信頼性が高く, 効果的な不確実性を考慮した評価システムを構築するための基盤となるものと考えられる。

論文の概要: How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

関連論文リスト