Fugu-MT 論文翻訳(概要): LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

論文の概要: LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

arxiv url: http://arxiv.org/abs/2511.10819v2
Date: Mon, 17 Nov 2025 22:45:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-19 13:59:16.596148
Title: LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation
Title（参考訳）: LLM-as-a-Grader: 短時間回答と報告評価のための大規模言語モデルからの実践的考察
Authors: Grace Byun, Swati Rajwal, Jinho D. Choi,
Abstract要約: 大規模言語モデル(LLM)は、グラデーションのような教育的なタスクのためにますます研究されている。本研究は,LLMを用いて短期回答クイズとプロジェクトレポートを大学生の計算言語学コースで評価することの実現可能性について検討した。
参考スコア（独自算出の注目度）: 11.663970954805395
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly explored for educational tasks such as grading, yet their alignment with human evaluation in real classrooms remains underexamined. In this study, we investigate the feasibility of using an LLM (GPT-4o) to evaluate short-answer quizzes and project reports in an undergraduate Computational Linguistics course. We collect responses from approximately 50 students across five quizzes and receive project reports from 14 teams. LLM-generated scores are compared against human evaluations conducted independently by the course teaching assistants (TAs). Our results show that GPT-4o achieves strong correlation with human graders (up to 0.98) and exact score agreement in 55\% of quiz cases. For project reports, it also shows strong overall alignment with human grading, while exhibiting some variability in scoring technical, open-ended responses. We release all code and sample data to support further research on LLMs in educational assessment. This work highlights both the potential and limitations of LLM-based grading systems and contributes to advancing automated grading in real-world academic settings.
Abstract（参考訳）: 大規模言語モデル (LLMs) は、学級化などの教育的課題のために研究されているが、実際の教室での人間の評価との整合性はいまだに過小評価されている。本研究では,LLM (GPT-4o) を用いて, 短期回答クイズとプロジェクト報告の実施可能性について検討した。約50人の学生から5つのクイズから回答を集め、14チームからプロジェクトレポートを受け取ります。 LLM生成スコアは、学習指導助手(TA)が単独で行う人間評価と比較される。以上の結果から,GPT-4oは,55%のクイズ症例において,高学年(最大0.98)と正確なスコア一致と強い相関が得られた。プロジェクトの報告では、人間のグレーティングとの全体的な整合性も強く、技術的、オープンな応答のスコアリングにはある程度のばらつきがある。教育評価におけるLCMのさらなる研究を支援するため,全コードおよびサンプルデータをリリースする。この研究は、LLMベースのグレーティングシステムの可能性と限界を強調し、現実世界の学術的環境における自動グレーティングの促進に貢献している。

論文の概要: LLM-as-a-Grader: Practical Insights from Large Language Model for Short-Answer and Report Evaluation

関連論文リスト