Related papers: StepFun-Prover Preview: Let's Think and Verify Step by Step

StepFun-Prover Preview: Let's Think and Verify Step by Step

URL: http://arxiv.org/abs/2507.20199v3
Date: Wed, 13 Aug 2025 10:58:50 GMT
Title: StepFun-Prover Preview: Let's Think and Verify Step by Step
Authors: Shijie Shang, Ruosi Wan, Yue Peng, Yutong Wu, Xiong-hui Chen, Jie Yan, Xiangyu Zhang,
Abstract summary: We present StepFun-Prover Preview, a large language model designed for formal theorem proving through tool-integrated reasoning.<n>Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback.<n>On the miniF2F-test benchmark, StepFun-Prover achieves a pass@1 success rate of $70.0%$.
Score: 14.896796588073725
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present StepFun-Prover Preview, a large language model designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover achieves a pass@1 success rate of $70.0\%$. Beyond advancing benchmark performance, we introduce an end-to-end training framework for developing tool-integrated reasoning models, offering a promising direction for automated theorem proving and Math AI assistant.

Related papers

PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning [30.44007644340425]
We introduce PROPA, a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations.<n>Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines.<n>It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art.
arXiv Detail & Related papers (2025-11-13T13:06:12Z)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z)
Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning [38.4246156415702]
Kimina-Prover is a large language model that pioneers a novel reasoning-driven exploration paradigm for formal theorem proving.<n>Trained with a large-scale reinforcement learning pipeline from Qwen2.5-72B, Kimina-Prover demonstrates strong performance in Lean 4 proof generation.<n>Kimina-Prover sets a new state-of-the-art on the miniF2F benchmark, reaching 80.7% with pass@8192.
arXiv Detail & Related papers (2025-04-15T16:23:44Z)
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [86.32257216965229]
We propose a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding.<n>StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR)<n>With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning.
arXiv Detail & Related papers (2025-03-17T08:51:44Z)
LeanProgress: Guiding Search for Neural Theorem Proving via Proof Progress Prediction [74.79306773878955]
We introduce LeanProgress, a method that predicts the progress in the proof.<n>Our experiments show that LeanProgress achieves an overall prediction accuracy of 75.1%.
arXiv Detail & Related papers (2025-02-25T07:46:36Z)
AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence [29.551802573731305]
We propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word.<n>We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks.
arXiv Detail & Related papers (2025-02-19T18:35:55Z)
BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z)
ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis [50.020850767257095]
We propose ProofAug, a procedure that equips LLMs with automation methods at various granularities.<n>Our method is validated on the miniF2F benchmark using the open-source deep-math-7b-base model and the Isabelle proof assistant.<n>We also implement a Lean 4 version of ProofAug that can improve the pass@1 performance of Kimina-Prover-seek-Distill-1.5B from 44.3% to 50.4%.
arXiv Detail & Related papers (2025-01-30T12:37:06Z)
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback [94.25162866972077]
Step-KTO is a training framework that combines process-level and outcome-level binary feedback.<n>Our experiments show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps.
arXiv Detail & Related papers (2025-01-18T15:38:03Z)
Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization [4.374023944113174]
We introduce Prom, an open-source library to enhance the robustness and performance of predictive models.<n>Prom achieves this by using statistical assessments to identify test samples prone to mispredictions.<n>Our evaluation demonstrates that Prom can successfully identify an average of 96% (up to 100%) of mispredictions.
arXiv Detail & Related papers (2024-12-31T06:17:03Z)
InternLM2.5-StepProver: Advancing Automated Theorem Proving via Critic-Guided Search [65.05674971652776]
A prevalent proof method involves the prover iteratively constructing the proof tactic by tactic, typically following a best-first search scheme.<n>We propose an intuitive yet effective method, which utilizes a critic model to capture the preference information.<n>A large-scale expert iteration with more than 20,000 CPU days is then applied to further fine-tune the prover and the critic.
arXiv Detail & Related papers (2024-10-21T07:18:23Z)
Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.<n>Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.