Fugu-MT 論文翻訳(概要): Benchmarking Testing in Automated Theorem Proving

論文の概要: Benchmarking Testing in Automated Theorem Proving

arxiv url: http://arxiv.org/abs/2604.23698v1
Date: Sun, 26 Apr 2026 13:24:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.504721
Title: Benchmarking Testing in Automated Theorem Proving
Title（参考訳）: 自動定理証明におけるベンチマークテスト
Authors: Jongyoon Kim, Hojae Han, Seung-won Hwang,
Abstract要約: T は形式定理の意味的正しさを評価する枠組みである。 5つの実世界のLean 4リポジトリからベンチマークを構築します。実験により、最先端のモデルでは高いコンパイル成功を達成できるが、セマンティック・メトリックでは著しく性能が低下することが示された。
参考スコア（独自算出の注目度）: 39.65133452374143
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have shown promise in formal theorem proving, yet evaluating semantic correctness remains challenging. Existing evaluations rely on indirect proxies such as lexical overlap with human-annotated proof, or expensive manual inspection. Inspired by the shift from lexical comparison to test-based evaluation in code generation, we propose T , a framework that evaluates the semantic correctness of formal theorems: a generated theorem is considered correct only if all dependent successor theorems compile successfully, analogous to integration testing. We construct a benchmark from 5 real-world Lean 4 repositories, comprising 2,206 problems paired with 41 successor theorems on average, automatically extracted without human effort. Experiments demonstrate that while state-of-the-art models achieve high compilation success, they perform significantly worse under our semantic metric. The best model, Claude-Sonnet-4.5, achieves only 38.9% Testing Accuracy on the full set, given both natural language proof and successor theorems as context, revealing a critical gap in current theorem generation capabilities.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、形式的定理の証明において有望であることを示しているが、意味的正当性を評価することは依然として困難である。既存の評価は、人間による注釈付き証明と語彙的重複や高価な手動検査のような間接的プロキシに依存している。コード生成における語彙比較からテストベース評価へのシフトに着想を得て,形式定理の意味的正当性を評価するフレームワークTを提案する。我々は5つの実世界のLean 4リポジトリからベンチマークを構築し、平均41の後継者定理と組み合わせた2,206の問題を人間の努力なしに自動的に抽出した。実験により、最先端のモデルでは高いコンパイル成功を達成できるが、セマンティック・メトリックでは著しく性能が低下することが示された。最良のモデルであるClaude-Sonnet-4.5は、自然言語の証明と後継定理の両方を文脈として、完全集合上でのテスト精度を38.9%しか達成していない。

論文の概要: Benchmarking Testing in Automated Theorem Proving

関連論文リスト