Fugu-MT 論文翻訳(概要): GradeLegal: Automated Grading for German Legal Cases

論文の概要: GradeLegal: Automated Grading for German Legal Cases

arxiv url: http://arxiv.org/abs/2605.21076v1
Date: Wed, 20 May 2026 12:09:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.656545
Title: GradeLegal: Automated Grading for German Legal Cases
Title（参考訳）: グレードレガル:ドイツ法定患者の自動筆記法
Authors: Abdullah Al Zubaer, Lorenz Wendlinger, Simon Alexander Nonn, Michael Granitzer, Jelena Mitrovic,
Abstract要約: グレーディングドイツの司法試験ソリューションは、数量の増加と適格グレーダーの不足に直面している。この実践的関連性にもかかわらず、文学は法的試験を格付けするための効果的な方法に関する体系的な研究を欠いている。大規模言語モデル(LLM)が,刑法及び公法におけるドイツの判例ソリューションの自動格付けを支援することができるかどうかを検討する。
参考スコア（独自算出の注目度）: 3.376444850947719
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.
Abstract（参考訳）: ドイツの司法試験ソリューションは、数量の増加と適格グレーダーの不足に直面し、フィードバックを遅らせ、ボトルネックを生み出している。同時に、国家試験の成績がドイツの職業成績に強く影響するため、高い評価の専門職である。この実践的関連性にもかかわらず、文学は法的試験を格付けするための効果的な方法に関する体系的な研究を欠いている。このギャップに対処するために,大規模言語モデル (LLM) が,刑法および公法におけるドイツの訴訟ソリューションの自動格付けを支援することができるかどうかを検証し,スケーラブルなフィードバックと学生の自己検査を可能にする。本稿では,27のプロプライエタリかつオープンソースのLCMを体系的に評価し,サンプルソリューションやグレーディングルーリックなどのタスク関連情報を段階的に追加するベンチマーク戦略を提案する。二次重み付きカッパ(QWK)を用いることで、検定指向のLLMは、刑事法では0.60に比較して、サンプル溶液とグレーディングルーリック(最大0.91まで)を与えられた場合、公共法における専門家の格付けを近似することができる。シングルモデルのグレーディング以外にも、アンサンブルは最高のメンバーに対して最大0.15までの合意を改善し、より強力なクローズドソースシングルモデルの代替を提供することができる。さらに, 法定試験の信頼性向上には, 効果的な設計・モデル選択が必要であることが示唆された。

論文の概要: GradeLegal: Automated Grading for German Legal Cases

関連論文リスト