Fugu-MT 論文翻訳(概要): Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

論文の概要: Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

arxiv url: http://arxiv.org/abs/2603.25633v1
Date: Thu, 26 Mar 2026 16:43:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.385474
Title: Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?
Title（参考訳）: 大規模言語モデルにおける数学的問題解決の専門知識は評価性能に結びつくか?
Authors: Liang Zhang, Yu Fu, Xinyi Jin,
Abstract要約: より強力な数学問題解決能力が、より強力なステップレベルの評価性能に結びついているかどうかは不明だ。本研究では,GSM8KとProcessBENCHのMATHサブセットとの関係について検討した。
参考スコア（独自算出の注目度）: 8.840705133076877
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners' reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with stronger step-level assessment performance. This study examines that relationship using the GSM8K and MATH subsets of PROCESSBENCH, a human-annotated benchmark for identifying the earliest erroneous step in mathematical reasoning. We evaluate two LLM-based math tutor agent settings, instantiated with GPT-4 and GPT-5, in two independent tasks on the same math problems: solving the original problem and assessing a benchmark-provided solution by predicting the earliest erroneous step. Results show a consistent within-model pattern: assessment accuracy is substantially higher on math problem items the same model solved correctly than on items it solved incorrectly, with statistically significant associations across both models and datasets. At the same time, assessment remains more difficult than direct problem solving, especially on error-present solutions. These findings suggest that math problem-solving expertise supports stronger assessment performance, but reliable step-level diagnosis also requires additional capabilities such as step tracking, monitoring, and precise error localization. The results have implications for the design and evaluation of AI-supported Adaptive Instructional Systems (AISs) for formative assessment in math education.
Abstract（参考訳）: 大規模言語モデル (LLMs) は数学教育において問題解法だけでなく、学習者の推論のアセスメントとしても用いられるようになっている。しかし、より強力な数学問題解決能力がより強力なステップレベルの評価性能に結びついているかどうかは不明である。本研究では,GSM8KとProcessBENCHのMATHサブセットとの関係について検討した。 GPT-4 と GPT-5 でインスタンス化した 2 つの LLM ベースの数学チューターエージェントの設定を,同一の数学問題に対する2 つの独立したタスクで評価する。評価精度は、数学の問題項目において、不正に解決した項目よりも、同じモデルを正しく解き、モデルとデータセットの両方に統計的に有意な関連がある。同時に、アセスメントは直接的な問題解決よりも困難であり、特にエラーを提示するソリューションでは困難である。これらの結果から,数学の問題解決の専門知識が評価性能の向上を支援することが示唆されるが,信頼度の高い段階診断には,ステップトラッキングやモニタリング,正確なエラー位置推定などの追加機能が必要である。その結果、数学教育における形式的評価のためのAIS(Adaptive Instructional Systems)の設計と評価に影響を及ぼす。

論文の概要: Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

関連論文リスト