Fugu-MT 論文翻訳(概要): FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

論文の概要: FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

arxiv url: http://arxiv.org/abs/2511.02872v1
Date: Tue, 04 Nov 2025 03:25:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 18:19:32.183448
Title: FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels
Title（参考訳）: FATE: 複数難易度のフロンティア代数のためのフォーマルベンチマークシリーズ
Authors: Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailing Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong,
Abstract要約: FATE (Formal Algebra Theorem Evaluation) は形式代数学の新しいベンチマークシリーズである。我々はFATE-H と FATE-X という2つの新しい成分を示し、それぞれ抽象代数学と可換代数学における100の問題を解く。 FATE-XはPhDレベルの試験の難しさとMathlibライブラリのカバレッジを超えた最初の正式なベンチマークである。
参考スコア（独自算出の注目度）: 7.8395206631845324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、特にIMOのような競合ベースの数学的ベンチマークにおいて、形式的定理証明における印象的な能力を示している。しかし、これらのコンテストは現代の数学研究の深さ、幅、抽象化を反映していない。このギャップを埋めるために、フォーマル代数の新しいベンチマークシリーズである FATE (Formal Algebra Theorem Evaluation) を導入する。我々はFATE-H と FATE-X という2つの新しい成分を示し、それぞれ抽象代数学と可換代数学における100の問題を解く。 FATEシリーズは、学部のエクササイズからPhDの資格試験を超える問題まで、難易度の範囲にまたがっている。特に、FATE-XはPhDレベルの試験の難しさとMathlibライブラリのカバレッジを超えた最初の正式なベンチマークである。提案手法は,FATE-Hでは3%(pass@64),FATE-Xでは0%の精度しか得られない。我々の2段階評価では、モデルの自然言語推論は、この推論を形式化する能力よりも顕著に正確であることが示されている。この形式化プロセスで発生する共通エラーを体系的に分類する。さらに、比較研究により、特殊証明器は汎用モデルよりも効果の低い反射を示し、自然言語の段階での精度を低下させることが示されている。我々は、FATEが研究レベルの公式な数学的推論への道筋に不可欠なチェックポイントを確立する、堅牢で挑戦的なベンチマークを提供すると考えている。

論文の概要: FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

関連論文リスト