Related papers: Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

URL: http://arxiv.org/abs/2508.15878v1
Date: Thu, 21 Aug 2025 14:15:40 GMT
Title: Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs
Authors: Terry Jingchen Zhang, Wenyuan Jiang, Rongchuan Liu, Yisong Wang, Junran Yang, Ning Wang, Nicole Ni, Yinya Huang, Mrinmaya Sachan,
Abstract summary: We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems.<n>We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning.<n>Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges.
Score: 41.29431283264807
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5\% success on Busy Beaver problems, it manages only 12\% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.

Related papers

IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch [8.80477323574638]
We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving.<n>IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads.
arXiv Detail & Related papers (2025-11-30T17:40:13Z)
Transformers Provably Learn Chain-of-Thought Reasoning with Length Generalization [53.89723291716722]
A crucial question about AI reasoning is whether models can extrapolate learned reasoning patterns to solve harder tasks with longer chain-of-thought (CoT)<n>We mathematically prove how the algebraic structure of state-tracking problems governs the degree of extrapolation of the learned CoT.<n>We provide the first optimization guarantee that constant-depth transformers provably learn $mathsfNC1$-complete problems with CoT.
arXiv Detail & Related papers (2025-11-10T18:40:24Z)
Towards Solving More Challenging IMO Problems via Decoupled Reasoning and Proving [48.22540519786074]
Recent studies show that the informal accuracy exceeds 80% while formal success remains below 8% on benchmarks like PutnamBench.<n>We propose a novel framework that decouples high-level reasoning from low-level proof generation.<n>We evaluate our method on a challenging set of post-2000 IMO problems, a problem set on which no prior open-source prover has reported success.
arXiv Detail & Related papers (2025-07-07T22:38:49Z)
Enumerate-Conjecture-Prove: Formally Solving Answer-Construction Problems in Math Competitions [37.10426226729792]
We introduce the LLMe-Conjecture-Prove (ECP) framework, a modular neuro-symbolic method integrating pattern-driven conjecturing with formal theorem proving.<n>We present ConstructiveBench, a dataset of 3,431 answer-Thought problems in various math competitions with verified Lean formalizations.
arXiv Detail & Related papers (2025-05-24T03:52:25Z)
Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving [44.5694120006447]
We present a principled formulation of problem-solving as a deterministic Markov decision process.<n>We also present a novel framework, FPS, which utilizes existing FTP environments to perform process-verified problem-solving.<n>We construct three benchmarks on problem-solving: FormalMath500, a formalization of a subset of the MATH500 benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP benchmarks MiniF2F and PutnamBench.
arXiv Detail & Related papers (2025-05-07T16:02:14Z)
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models [17.919212265668783]
We present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems.<n>We introduce a novel human-in-the-loop autoformalization pipeline that integrates statement autoformalization, semantic verification, and negation-based disproof filtering strategies.<n>Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations.
arXiv Detail & Related papers (2025-05-05T15:37:00Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
Artifical intelligence and inherent mathematical difficulty [0.0]
We first present an updated version of a traditional argument that limitative results from computability and complexity theory show that proof discovery is an inherently difficult problem. We then illustrate how several recent applications of artificial intelligence-inspired methods do indeed raise novel questions about the nature of mathematical proof.
arXiv Detail & Related papers (2024-08-01T20:08:31Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression [127.68780714438103]
Two main geometry problems: calculation and proving, are usually treated as two specific tasks. We construct a large-scale Unified Geometry problem benchmark, UniGeo, which contains 4,998 calculation problems and 9,543 proving problems. We also present a unified multi-task Geometric Transformer framework, Geoformer, to tackle calculation and proving problems simultaneously.
arXiv Detail & Related papers (2022-12-06T04:37:51Z)
From Checking to Inference: Actual Causality Computations as Optimization Problems [79.87179017975235]
We present a novel approach to formulate different notions of causal reasoning, over binary acyclic models, as optimization problems. We show that both notions are efficiently automated. Using models with more than $8000$ variables, checking is computed in a matter of seconds, with MaxSAT outperforming ILP in many cases.
arXiv Detail & Related papers (2020-06-05T10:56:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.