Fugu-MT 論文翻訳(概要): Confidence Estimation in Automatic Short Answer Grading with LLMs

論文の概要: Confidence Estimation in Automatic Short Answer Grading with LLMs

arxiv url: http://arxiv.org/abs/2605.00200v1
Date: Thu, 30 Apr 2026 20:26:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.742143
Title: Confidence Estimation in Automatic Short Answer Grading with LLMs
Title（参考訳）: LLMを用いた自動短解像の信頼度推定
Authors: Longwei Cong, Sonja Hahn, Sebastian Gombert, Leon Camus, Hendrik Drachsler, Ulf Kroehne,
Abstract要約: 生成型大言語モデル(LLM)を用いた自動短解像(ASAG)は,タスク固有の微調整を伴わずに高い性能を示した。 LLMのグレーディングは依然として不完全であり、安全で効果的な人間とAIのコラボレーションには信頼性の高い信頼度推定が不可欠である。本稿では,モデルに基づく信頼度信号とデータセット由来のアレータ的不確実性の明示的な推定を統合したハイブリッド信頼フレームワークを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.
Abstract（参考訳）: 生成型大規模言語モデル(LLM)を用いた自動短解像(ASAG)は,タスク固有の微調整を伴わずに高い性能を示しつつ,教育評価のための合成フィードバックの生成を可能にしている。これらの進歩にもかかわらず、LLMに基づく格付けは依然として不完全であり、教育意思決定において安全かつ効果的な人間とAIのコラボレーションに不可欠な信頼度を推定する。本研究では,モデルに基づく信頼信号とデータセット由来の不確かさを共同で検討し,LCMを用いたASAGの信頼度推定について検討する。我々は,3つのモデルベース信頼度推定戦略,すなわち言語化,潜伏,一貫性に基づく信頼度推定を体系的に比較し,モデルベース信頼度だけではASAGの不確実性を確実に把握できないことを示す。この制限に対処するため、モデルに基づく信頼信号とデータセット由来のアレタリック不確実性の明確な推定を統合したハイブリッド信頼フレームワークを提案する。アレータリック不確実性は、意味的に埋め込まれた学生の反応をクラスタリングし、クラスタ内の不均一性を定量化する。提案手法により,提案手法により信頼性が向上し,選択的な評価性能が向上することを示した。全体として、この研究は、より信頼できるAI支援教育アセスメントシステムをサポートする、人間によるループアセスメントのための信頼性を意識したLCMベースのグレーディングを推進している。

論文の概要: Confidence Estimation in Automatic Short Answer Grading with LLMs

関連論文リスト