Fugu-MT 論文翻訳(概要): MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

論文の概要: MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

arxiv url: http://arxiv.org/abs/2606.13782v2
Date: Mon, 15 Jun 2026 05:26:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 13:45:31.22235
Title: MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis
Title（参考訳）: MA-ProofBench:数学解析における理論証明のためのLLMの2段階評価
Authors: Lushi Pu, Weiming Zhang, Xinheng Xie, Zixuan Fu, Bingxiang He, Hongya Lyu, Xin Li, Jie Zhou, Yudong Wang,
Abstract要約: MA-ProofBenchは数学解析に特化した最初の公式な定理証明ベンチマークである。ベンチマークには6つのコアトピックと27のサブカテゴリをカバーする200の形式化された定理が含まれており、測定と積分理論、複素解析、関数解析が含まれる。我々は、MA-ProofBench上での最近の汎用推論モデルと形式定理プロバーについて評価する。
参考スコア（独自算出の注目度）: 28.906916840252077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have made notable progress in automated theorem proving, yet existing formal benchmarks remain limited in both mathematical coverage and difficulty. Most are concentrated in areas that are easier to formalize, such as algebra and elementary number theory, and provide limited coverage of subfields that require deeper reasoning, including mathematical analysis. To address this gap, we introduce MA-ProofBench, to the best of our knowledge, the first formal theorem-proving benchmark dedicated to Mathematical Analysis. The benchmark contains 200 formalized theorems covering 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis. The problems are divided into two difficulty levels, an undergraduate level (Level I, 100 problems) and a Ph.D. qualifying level (Level II, 100 problems), to evaluate how well LLMs perform formal reasoning at different mathematical depths. Each problem is constructed through a human-led, LLM-assisted formalization pipeline followed by independent expert review, ensuring that the formal statements remain faithful to the original mathematics. We evaluate a range of recent general-purpose reasoning models and formal theorem provers on MA-ProofBench. However, most models perform poorly: even the best-performing model, GPT-5.5, achieves only 16% Pass@8 on Level I and 5% on Level II, while most models stay close to 0% on Level II. Further analysis identifies Mathlib hallucinations and incomplete proofs as the two dominant failure modes, while an evaluation on the natural-language version of the benchmark exposes a clear gap between informal and formal reasoning. MA-ProofBench is intended to serve as a reliable reference for tracking progress in formal mathematical reasoning in advanced domains.
Abstract（参考訳）: 大規模言語モデル(LLM)は、自動定理証明において顕著な進歩を遂げているが、既存の公式ベンチマークは、数学的カバレッジと難易度の両方で制限されている。多くは代数や素数理論などの形式化が容易な領域に集中しており、数学的解析を含むより深い推論を必要とする部分体を限定的にカバーしている。このギャップに対処するため、我々はMA-ProofBenchを紹介します。ベンチマークには6つのコアトピックと27のサブカテゴリをカバーする200の形式化された定理が含まれており、測定と積分理論、複素解析、関数解析が含まれる。問題は2つの難易度、学部レベル(レベルI、100問題)とPh.D.予選レベル(レベルII、100問題)に分けられ、LLMが数学的深度で形式的推論をいかにうまく行うかを評価する。それぞれの問題は、人間主導のLSM支援形式化パイプラインを通じて構築され、その後独立した専門家によるレビューによって、形式的ステートメントが元の数学に忠実であることを保証する。我々は、MA-ProofBench上での最近の汎用推論モデルと形式定理プロバーについて評価する。 GPT-5.5はレベルIでは16%のPass@8しか達成せず、レベルIIでは5%、レベルIIでは0%近くにとどまっている。さらなる分析では、Mathlibの幻覚と不完全証明を2つの支配的な失敗モードとして特定し、一方、ベンチマークの自然言語版に対する評価は、非公式な推論と形式的な推論の間に明確なギャップを露呈している。 MA-ProofBenchは、高度な領域における公式な数学的推論の進捗を追跡するための信頼性の高い基準として機能することを意図している。

論文の概要: MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

関連論文リスト