Fugu-MT 論文翻訳(概要): FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

論文の概要: FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

arxiv url: http://arxiv.org/abs/2603.26996v1
Date: Fri, 27 Mar 2026 21:14:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.726576
Title: FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?
Title（参考訳）: FormalProofBench: モデルでは、形式的に検証されたレベル数学の証明を書けるか?
Authors: Nikil Ravi, Kexing Ying, Vasilii Nesterov, Rayan Krishnan, Elif Uskuplu, Bingyu Xia, Janitha Aswedige, Langston Nashold,
Abstract要約: 本稿では、AIモデルが正式に検証された数学的証明を生成できるかどうかを評価するために設計されたプライベートベンチマークであるFormalProofBenchを紹介する。各タスクは、自然言語とLean4の形式的なステートメントをペアリングし、モデルはLean 4チェッカーが受け入れるリーンの証明を出力する必要があります。エージェントハーネスを用いたフロンティアモデルの評価を行い,最も優れた基礎モデルが33.5%の精度で達成できることを見出した。
参考スコア（独自算出の注目度）: 0.04324039930655099
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present FormalProofBench, a private benchmark designed to evaluate whether AI models can produce formally verified mathematical proofs at the graduate level. Each task pairs a natural-language problem with a Lean~4 formal statement, and a model must output a Lean proof accepted by the Lean 4 checker. FormalProofBench targets advanced undergraduate and graduate mathematics, with problems drawn from qualifying exams and standard textbooks across topics including analysis, algebra, probability, and logic. We evaluate a range of frontier models with an agentic harness, and find that the best-performing foundation model achieves 33.5% accuracy, with performance dropping rapidly after that. In addition to the accuracy numbers, we also provide empirical analysis of tool-use, failure modes, cost and latency, thereby providing a thorough evaluation of the formal-theorem proving abilities of frontier models.
Abstract（参考訳）: 我々は、AIモデルが卒業レベルで正式に検証された数学的証明を作成できるかどうかを評価するために設計されたプライベートベンチマークであるFormalProofBenchを提案する。各タスクは、Lean～4の形式的なステートメントと自然言語の問題をペアリングし、モデルはLean 4チェッカーが受け入れるリーンの証明を出力する必要があります。 FormalProofBenchは、高度な学部と大学院の数学を対象とし、分析、代数、確率、論理などを含む分野にわたる資格試験と標準教科書から引き出された問題がある。エージェントハーネスを用いたフロンティアモデルの評価を行い、最適性能の基礎モデルが33.5%の精度で達成され、その後性能は急速に低下した。また, ツール使用, 障害モード, コスト, レイテンシの実験的解析を行い, フロンティアモデルの形式的理論実証能力を徹底的に評価する。

論文の概要: FormalProofBench: Can Models Write Graduate Level Math Proofs That Are Formally Verified?

関連論文リスト