Fugu-MT 論文翻訳(概要): Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs

論文の概要: Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs

arxiv url: http://arxiv.org/abs/2603.16357v1
Date: Tue, 17 Mar 2026 10:40:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.225578
Title: Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs
Title（参考訳）: グレーディング精度を超えて - TAとLLMのアライメントを探る
Authors: Matthijs Jansen op de Haar, Nacir Bouali, Faizan Ahmed,
Abstract要約: 本稿では,Unified Language (UML) クラス図のグレーディングにおけるオープンソースのLarge Language Models (LLMs) の可能性について検討する。その結果, 基準あたりの精度は88.56%, ピアソン相関係数は0.78であり, 従来よりも大幅に向上した。
参考スコア（独自算出の注目度）: 1.529342790344802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we investigate the potential of open-source Large Language Models (LLMs) for grading Unified Modeling Language (UML) class diagrams. In contrast to existing work, which primarily evaluates proprietary LLMs, we focus on non-proprietary models, making our approach suitable for universities where transparency and cost are critical. Additionally, existing studies assess performance over complete diagrams rather than individual criteria, offering limited insight into how automated grading aligns with human evaluation. To address these gaps, we propose a grading pipeline in which student-generated UML class diagrams are independently evaluated by both teaching assistants (TAs) and LLMs. Grades are then compared at the level of individual criteria. We evaluate this pipeline through a quantitative study of 92 UML class diagrams from a software design course, comparing TA grades against assessments produced by six popular open-source LLMs. Performance is measured across individual criterion, highlighting areas where LLMs diverge from human graders. Our results show per-criterion accuracy of up to 88.56% and a Pearson correlation coefficient of up to 0.78, representing a substantial improvement over previous work while using only open-source models. We also explore the concept of an optimal model that combines the best-performing LLM per criterion. This optimal model achieves performance close to that of a TA, suggesting a possible path toward a mixed-initiative grading system. Our findings demonstrate that open-source LLMs can effectively support UML class diagram grading by explicitly identifying grading alignment. The proposed pipeline provides a practical approach to manage increasing assessment workloads with growing student counts.
Abstract（参考訳）: 本稿では,Unified Modeling Language (UML) クラス図のグレーディングにおけるオープンソースのLarge Language Models (LLMs) の可能性について検討する。独占的なLCMを主に評価する既存の作業とは対照的に、我々は、透明性とコストが重要となる大学に、我々のアプローチを適合させる、非プロプライエタリなモデルに焦点をあてる。さらに、既存の研究では、個々の基準ではなく、完全な図よりもパフォーマンスを評価し、自動階調が人間の評価とどのように一致しているかについての限られた洞察を与えている。これらのギャップに対処するために、学生が生成したUMLクラス図を、教師アシスタント(TA)とLLMの両方で独立に評価するグレーディングパイプラインを提案する。成績は個別の基準で比較される。我々は,このパイプラインをソフトウェア設計コースから92のUMLクラス図を定量的に評価し,TAグレードと6つの人気のあるオープンソースLCMによる評価を比較した。性能は個々の基準にまたがって測定され、LLMが人間のグレーダーから分岐する領域が強調される。本研究の結果は, オープンソースモデルのみを用いて, 従来よりも大幅に改善したPearson相関係数を最大88.56%, Pearson相関係数最大0.78を示した。また,評価基準当たりの最高のLCMを組み合わせた最適モデルについても検討する。この最適モデルは、TAに近い性能を達成し、混合開始階調システムへの経路を示唆する。以上の結果から,オープンソースのLLMは,階層化アライメントを明確に識別することにより,UMLクラスダイアグラムのグレーティングを効果的にサポートできることが示唆された。提案するパイプラインは,学生数の増加に伴うアセスメントワークロードの増加を管理するための実践的なアプローチを提供する。

論文の概要: Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs

関連論文リスト