Fugu-MT 論文翻訳(概要): RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

論文の概要: RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

arxiv url: http://arxiv.org/abs/2510.09021v1
Date: Fri, 10 Oct 2025 05:47:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:48.193631
Title: RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows
Title（参考訳）: RefGrader: エージェントワークフローを用いた数理競合証明の自動解析
Authors: Hamed Mahdavi, Pouria Mahdavinia, Samira Malek, Pegah Mohammadipour, Alireza Hashemi, Majid Daliri, Alireza Farhadi, Amir Khasahmadi, Niloofar Mireshghallah, Vasant Honavar,
Abstract要約: State-of-the-art (SOTA) LLMは、証明ベースのOlympiad問題から、IMO 2025問題のほとんどを解決するまで、進歩してきた。本稿では,90 Gemini 2.5 Pro生成ソリューションのコーパスを用いて,詳細なエラーアノテーションを用いた1-4スケールで評価を行った。分析の結果、モデルが不正確な解を確実にフラグ付けできるが、部分クレジットの割り当て方法にキャリブレーションのギャップがあることがわかった。
参考スコア（独自算出の注目度）: 8.700422995850152
License: http://creativecommons.org/licenses/by/4.0/
Abstract: State-of-the-art (SOTA) LLMs have progressed from struggling on proof-based Olympiad problems to solving most of the IMO 2025 problems, with leading systems reportedly handling 5 of 6 problems. Given this progress, we assess how well these models can grade proofs: detecting errors, judging their severity, and assigning fair scores beyond binary correctness. We study proof-analysis capabilities using a corpus of 90 Gemini 2.5 Pro-generated solutions that we grade on a 1-4 scale with detailed error annotations, and on MathArena solution sets for IMO/USAMO 2025 scored on a 0-7 scale. Our analysis shows that models can reliably flag incorrect (including subtly incorrect) solutions but exhibit calibration gaps in how partial credit is assigned. To address this, we introduce agentic workflows that extract and analyze reference solutions and automatically derive problem-specific rubrics for a multi-step grading process. We instantiate and compare different design choices for the grading workflows, and evaluate their trade-offs. Across our annotated corpus and MathArena, our proposed workflows achieve higher agreement with human grades and more consistent handling of partial credit across metrics. We release all code, data, and prompts/logs to facilitate future research.
Abstract（参考訳）: 最先端のSOTA (State-of-the-art) LLMは、証明ベースのOlympiad問題からIMO 2025問題のほとんどを解決し、主要なシステムは6つのうち5つを処理していると報告されている。この進歩を踏まえ、これらのモデルがどの程度の精度で証明を格付けできるかを評価する。我々は,90 Gemini 2.5 Pro生成ソリューションを1-4スケールで詳細なエラーアノテーションで評価したコーパスと,0-7スケールで評価したIMO/USAMO 2025のMathArenaソリューションセットを用いて,実証分析能力について検討した。我々の分析では、モデルが不正確な解を確実にフラグ付けできるが、部分的信用の割り当て方法にはキャリブレーションのギャップがあることを示している。そこで本研究では,参照解の抽出と解析を行うエージェントワークフローを導入し,複数段階のグレーティングプロセスにおいて問題固有のルーリックを自動的に導出する。グレーディングワークフローのさまざまな設計選択をインスタンス化し、比較し、トレードオフを評価します。注釈付きコーパスとMathArena全体で、提案したワークフローは、人間のグレードとのより高い合意と、メトリクス間の部分クレジットの一貫性のある処理を実現します。将来の研究を促進するために、すべてのコード、データ、プロンプト/ログをリリースします。

論文の概要: RefGrader: Automated Grading of Mathematical Competition Proofs using Agentic Workflows

関連論文リスト