Fugu-MT 論文翻訳(概要): APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

論文の概要: APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

arxiv url: http://arxiv.org/abs/2505.05758v2
Date: Mon, 12 May 2025 08:03:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-13 14:13:13.058094
Title: APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning
Title（参考訳）: APOLLO: 高度な形式推論のための自動LLMとリーンコラボレーション
Authors: Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh,
Abstract要約: APOLLOは、Leanコンパイラの強みとLLMの推論能力を組み合わせた、モデルに依存しないパイプラインである。 miniF2Fベンチマークでは、新しい最先端精度75.0%が確立されている。
参考スコア（独自算出の注目度）: 8.056359341994941
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair via LLM and Lean cOllaboration), a modular, model-agnostic pipeline that combines the strengths of the Lean compiler with an LLM's reasoning abilities to achieve better proof-generation results at a low sampling budget. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top-K budget. The repaired sub-proofs are recombined and reverified, iterating up to a user-controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state-of-the-art accuracy of 75.0% among 7B-parameter models while keeping the sampling budget below one thousand. Moreover, Apollo raises the state-of-the-art accuracy for Goedel-Prover-SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General-purpose models (o3-mini, o4-mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compiler-guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving.
Abstract（参考訳）: 形式的推論と自動定理証明は機械学習の挑戦的なサブフィールドであり、機械はリーンのような形式言語を使って数学的定理を証明している。形式的検証システムは、形式的証明がほぼ瞬時に正しいかどうかを確認することができるが、大きな言語モデル(LLM)で完全に正しい形式的証明を生成することは、恐ろしい作業である。文献における通常のアプローチは、生成した証明の1つが検証システムを通過するまで、LCMに何回も(数千まで)促すことである。本稿では,LLMによるAPOLLO(Automated PrOof repair via LLM and Lean cOllaboration)について紹介する。 Apollo氏は、LLMが定理の証明を生成する完全自動化プロセス、エージェントのセット、証明の分析、構文エラーの修正、リーンを使用した証明の誤りの特定、失敗するサブレムマの分離、自動解法の利用、各目標のLCMの実行を、トップK予算を低く抑える。修復されたサブプロテクションは再結合され、再検証され、ユーザ制御された最大試行回数まで反復される。 miniF2Fベンチマークでは, サンプリング予算を1000以下に抑えながら, 7Bパラメータモデルのうち75.0%の精度を新たに確立した。さらにアポロは、サンプルの複雑さを25,600から数百に削減しながら、Goedel-Prover-SFTの最先端の精度を65.6%に引き上げている。汎用モデル(o3-mini、o4-mini)は3-7%から40%以上まで向上する。この結果から,LLM出力の目標とするコンパイラ誘導型修復は効率と正確性の両方で劇的に向上し,拡張性のある自動定理証明のパラダイムが示唆された。

論文の概要: APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

関連論文リスト