Fugu-MT 論文翻訳(概要): Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

論文の概要: Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

arxiv url: http://arxiv.org/abs/2606.12422v1
Date: Fri, 08 May 2026 16:32:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 07:09:36.883219
Title: Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering
Title（参考訳）: コンテキストエンジニアリングによるK-12 GenAIアセスメントグレーダの作成と評価
Authors: Zewei Tian, Alex Liu, Lief Esbenshade, Michael Xiao, Zachary Zhang, Yulia Lápicus, Thomas Han, Kevin He, Min Sun,
Abstract要約: 大型言語モデル(LLMs)の教育評価への統合は、教室のグレーディングの実践の変革的な変化を表している。本稿では,LLMグレーダの理論的基礎を検証し,商業的に利用可能な基礎モデルとコンテキストを併用し,学生の作業の成果をルーリックに対して評価する手法を提案する。
参考スコア（独自算出の注目度）: 6.131107680009006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The integration of large language models (LLMs) into educational assessment represents a transformative shift in classroom grading practices. While automated scoring systems and machine learning techniques have existed for decades, generative AI (GenAI) now enables educators to implement standards-based grading (SBG) with unprecedented efficiency and scale. This paper examines the theoretical foundations and evaluates an LLM grader that uses commercially available foundation models with context and prompt engineering to score student work against a rubric. Drawing on an empirical interrater agreement study using Massachusetts Comprehensive Assessment System (MCAS) data, we observed the Quadratic Weighted Kappa (QWK) and Proportional Reduction in Mean-Squared Error (PRMSE) across mathematics, science, and ELA, using Claude Sonnet 4, Haiku 4.5, GPT-5, and GPT-5 Mini. The results demonstrate that LLM graders, especially when based on foundational models with more parameters, achieve substantial agreement with human raters in mathematics and science assessments, while the performances vary in ELA, suggesting generic foundation models can be effective at scoring in given contexts. Additional analysis of teacher and student feedback reveals strong acceptance of AI-generated narrative feedback but skepticism toward numerical scores, suggesting that LLMs function most effectively as formative tools rather than summative evaluators. Our findings indicate that thoughtfully designed hybrid models that combine AI efficiency with teacher judgment can reduce workload, enhance feedback quality, and support equitable assessment practices without displacing professional expertise.
Abstract（参考訳）: 大型言語モデル(LLMs)の教育評価への統合は、教室のグレーディングの実践の変革的な変化を表している。自動スコアリングシステムと機械学習技術は何十年にもわたって存在してきたが、ジェネレーティブAI(GenAI)は、教育者が前例のない効率とスケールで標準ベースのグレーディング(SBG)を実装することを可能にする。本稿では,LLMグレーダの理論的基礎を検証し,商業的に利用可能な基礎モデルとコンテキストを併用し,学生の作業の成果をルーリックに対して評価する手法を提案する。マサチューセッツ総合評価システム(MCAS)データを用いた実証的インターラッター合意研究に基づき, 数学, 科学, ELAにおける平均二乗誤差(PRMSE)の4次重み付きカッパ(QWK)と比例還元(PRMSE)を, クロードソネット4, 俳句4.5, GPT-5, GPT-5 Miniを用いて検討した。その結果, LLMグレーダは, 特にパラメータの大きい基礎モデルに基づく場合, 数学や科学評価において人間とはかなりの一致を示し, ELAでは性能が異なっており, 基本モデルが与えられた文脈でのスコアリングに有効であることが示唆された。教師と学生のフィードバックのさらなる分析により、AI生成の物語的フィードバックは強く受け入れられるが、数値的なスコアに対する懐疑的な見方が示され、LLMは要約的評価よりも、最も効果的な形式的ツールとして機能することが示唆された。この結果から,AI効率と教師の判断を併用した設計されたハイブリッドモデルは,作業負荷を低減し,フィードバックの質を高め,専門家の専門知識を損なうことなく,公平な評価プラクティスをサポートすることが示唆された。

論文の概要: Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering

関連論文リスト