Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training
- URL: http://arxiv.org/abs/2511.13043v2
- Date: Tue, 18 Nov 2025 11:35:16 GMT
- Title: Spark-Prover-X1: Formal Theorem Proving Through Diverse Data Training
- Authors: Xinyuan Zhou, Yi Lei, Xiaoyu Zhou, Jingyi Sun, Yu Zhu, Zhongyi Ye, Weitai Zhang, Quan Liu, Si Wei, Cong Liu,
- Abstract summary: We introduce Spark-Prover-X1, a 7B parameter model trained via a three-stage framework to unlock the reasoning potential of moderately-sized LLMs.<n>Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning.<n>We also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems.
- Score: 24.74642687780046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have shown significant promise in automated theorem proving, yet progress is often constrained by the scarcity of diverse and high-quality formal language data. To address this issue, we introduce Spark-Prover-X1, a 7B parameter model trained via an three-stage framework designed to unlock the reasoning potential of more accessible and moderately-sized LLMs. The first stage infuses deep knowledge through continuous pre-training on a broad mathematical corpus, enhanced by a suite of novel data tasks. Key innovation is a "CoT-augmented state prediction" task to achieve fine-grained reasoning. The second stage employs Supervised Fine-tuning (SFT) within an expert iteration loop to specialize both the Spark-Prover-X1-7B and Spark-Formalizer-X1-7B models. Finally, a targeted round of Group Relative Policy Optimization (GRPO) is applied to sharpen the prover's capabilities on the most challenging problems. To facilitate robust evaluation, particularly on problems from real-world examinations, we also introduce ExamFormal-Bench, a new benchmark dataset of 402 formal problems. Experimental results demonstrate that Spark-Prover achieves state-of-the-art performance among similarly-sized open-source models within the "Whole-Proof Generation" paradigm. It shows exceptional performance on difficult competition benchmarks, notably solving 27 problems on PutnamBench (pass@32) and achieving 24.0\% on CombiBench (pass@32). Our work validates that this diverse training data and progressively refined training pipeline provides an effective path for enhancing the formal reasoning capabilities of lightweight LLMs. Both Spark-Prover-X1-7B and Spark-Formalizer-X1-7B, along with the ExamFormal-Bench dataset, are made publicly available at: https://www.modelscope.cn/organization/iflytek, https://gitcode.com/ifly_opensource.
Related papers
- Lifecycle-Aware code generation: Leveraging Software Engineering Phases in LLMs [12.70863561286374]
We introduce a lifecycle-aware framework that incorporates intermediate artifacts into both the training and inference stages.<n> Experiments show that lifecycle-level fine-tuning improves code correctness by up to 75% over the same model before fine-tuning.<n>Open-source LLMs, once fine-tuned under our framework, match or slightly outperform models pretrained on code.
arXiv Detail & Related papers (2025-10-28T02:54:02Z) - Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling [37.237020102873]
Reward models are essential for aligning Large Language Models with human values, yet their development is hampered by costly preference datasets and poor interpretability.<n>We build a training-free framework that infers high-quality, query-specific rubrics using a validation-guided textbfPropose-Evaluate-Revise pipeline.<n>Using just 70 preference pairs (1.5% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts.
arXiv Detail & Related papers (2025-10-20T09:01:37Z) - MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z) - SynthCoder: A Synthetical Strategy to Tune LLMs for Code Completion [7.668823606571788]
Code completion is a prominent application of Large Language Models (LLMs) in software engineering.<n>This paper proposes SynthCoder, a model that integrates leading industry practices to achieve state-of-the-art on the Fill-in-the-Middle (FIM) code completion task.
arXiv Detail & Related papers (2025-08-21T12:23:49Z) - Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction [95.91743732150233]
Goedel-Prover-V2, a series of open-source language models, set a new state-of-the-art in automated theorem proving.<n>We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems.<n>Goedel-Prover-V2-32B achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode.
arXiv Detail & Related papers (2025-08-05T16:28:22Z) - SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - Tuning the Right Foundation Models is What you Need for Partial Label Learning [55.61644150441799]
Partial label learning seeks to train generalizable classifiers from datasets with inexact supervision.<n>In this work, we empirically conduct evaluations of 11 foundation models across 13 approaches on 8 benchmark datasets under 3 scenarios.<n>Our findings reveal that current approaches tend to achieve significant performance gains when using foundation models, 2) exhibit remarkably similar performance to each other, 3) maintain stable performance across varying ambiguity levels, while 4) are susceptible to foundation model selection and adaptation strategies.
arXiv Detail & Related papers (2025-06-05T13:37:33Z) - FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models [19.984973014373118]
We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models.<n>FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs.
arXiv Detail & Related papers (2025-05-26T17:06:25Z) - Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source language model that achieves state-of-the-art (as of April 5 2025) performance in automated formal proof generation for mathematical problems.<n>First, we train LLMs to convert natural language math problems from the Numina dataset to equivalent formal statements in Lean 4.<n>Next, we develop a large dataset of formal proofs by training a series of provers.<n>Finally, we obtain the dataset Goedel-Pset-v1-solved, which contains proofs for over 800K statements from Goedel-Pset-v1
arXiv Detail & Related papers (2025-02-11T15:27:35Z) - STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving [33.61458249318183]
Self-play Theorem Prover (STP) takes on two roles, conjecturer and prover.<n>STP simultaneously takes on two roles, conjecturer and prover.<n>We evaluate with both Lean and Isabelle formal versifiers.
arXiv Detail & Related papers (2025-01-31T23:01:48Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Towards General and Efficient Online Tuning for Spark [55.30868031221838]
We present a general and efficient Spark tuning framework that can deal with the three issues simultaneously.
We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent.
arXiv Detail & Related papers (2023-09-05T02:16:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.