TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow
- URL: http://arxiv.org/abs/2601.17332v1
- Date: Sat, 24 Jan 2026 06:28:11 GMT
- Title: TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow
- Authors: Yicheng Tao, Hongteng Xu,
- Abstract summary: We introduce textbfTheoremForge, a cost-effective formal data synthesis pipeline.<n>Our strategy increases data yield by textbf1.6$times$ for proof generation compared to standard filtering.<n>Results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models.
- Score: 32.80602352741069
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce \textbf{TheoremForge}, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are \textit{statement formalization}, \textit{proof generation}, \textit{premise selection}, \textit{proof correction} and \textit{proof sketching}. By implementing a \textit{Decoupled Extraction Strategy}, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6\%, surpassing the 8.6\% baseline, at an average cost of only \textbf{\$0.481} per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by \textbf{1.6$\times$} for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available \href{https://github.com/timechess/TheoremForge}{here}.
Related papers
- LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks [4.6880826836662814]
We introduce textbfLOGIGEN, a logic-driven framework that synthesizes verifiable training data.<n>On $2$-Bench, LOGIGEN-32B(RL) achieves a textbf79.5% success rate, substantially outperforming the base model.
arXiv Detail & Related papers (2026-02-28T08:35:30Z) - Generative Data Transformation: From Mixed to Unified Data [57.84692191369066]
textscTaesar is a emphdata-centric framework for textbftarget-textbfal textbfregeneration.<n>It encodes cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures.
arXiv Detail & Related papers (2026-02-26T08:30:09Z) - The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee [5.345468714252351]
This work bridges the gap by developing an LLM-Verifier Convergence Theorem.<n>We model the interaction between the LLM and the verifier as a discrete-time Markov Chain.<n>We stress-tested this prediction in an extensive empirical campaign comprising more than 90,000 trials.
arXiv Detail & Related papers (2025-11-30T22:19:09Z) - Generative Myopia: Why Diffusion Models Fail at Structure [0.6768558752130311]
Graph Diffusion Models (GDMs) optimize for statistical likelihood, implicitly acting as textbffrequency filtersGenerative Myopia.<n>We prove theoretically and empirically that this failure is driven by textbfGradient Starvation.<n>We introduce textbfSpectrally-Weighted Diffusion, which re-aligns the variational objective using Effective Resistance.
arXiv Detail & Related papers (2025-11-23T19:27:13Z) - DAG-Math: Graph-Guided Mathematical Reasoning in LLMs [54.231935013127206]
Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT)<n>We propose modeling CoT as a certain rule-based process over directed acyclic graphs (DAGs)<n>We introduce logical closeness, a metric that quantifies how well a model's CoT trajectory adheres to the DAG structure.
arXiv Detail & Related papers (2025-10-19T21:05:17Z) - Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem [2.9458822941489387]
E-prover's saturation capabilities on the vast TPTP axiom library are used to derive a massive, guaranteed-valid corpus of theorems.<n>Our pipeline is principled and simple: axiom saturates, filter for "interesting" theorems, and generate tasks.<n>This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction.
arXiv Detail & Related papers (2025-09-08T15:43:29Z) - FormaRL: Enhancing Autoformalization with no Labeled Data [5.607770363395876]
We propose textbfFormaRL, a simple yet efficient reinforcement learning framework for autoformalization.<n>FormaRL integrates syntax check from Lean compiler and consistency check from large language model to calculate the reward.<n>Experiments show that FormaRL can increase the pass@1 autoformalization accuracy of Qwen2.5-Coder-7B-Instruct by 4 $sim$ 6x.
arXiv Detail & Related papers (2025-08-26T10:38:18Z) - Byzantine Failures Harm the Generalization of Robust Distributed Learning Algorithms More Than Data Poisoning [19.624245500772027]
A robust distributed learning algorithm aims to maintain reliable performance despite the presence of misbehaving workers.<n>Such misbehaviors are commonly modeled as $textitByzantine failures$, allowing arbitrarily corrupted communication, or as $textitdata poisoning$, a weaker form of corruption restricted to local training data.<n>We show for the first time, a fundamental gap in generalization guarantees between the two threat models: Byzantine failures yield strictly worse rates than those achievable under data poisoning.
arXiv Detail & Related papers (2025-06-22T12:59:15Z) - TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation [50.23504065567638]
This paper introduces textbfTD3, a novel textbfDataset textbfDistillation method within a meta-learning framework.<n> TD3 distills a fully expressive emphsynthetic sequence summary from original data.<n>An augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the emphouter-loop.
arXiv Detail & Related papers (2025-02-05T03:13:25Z) - Rectified Diffusion Guidance for Conditional Generation [94.83538269086613]
We revisit the theory behind CFG and rigorously confirm that the improper combination coefficients (textiti.e.) brings about expectation shift the generative distribution.<n>We show that our approach enjoys a textbftextitform solution given the strength.<n> Empirical evidence on real-world data demonstrate the compatibility of our design with existing state-of-the-art diffusion models.
arXiv Detail & Related papers (2024-10-24T13:41:32Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - SMaRt: Improving GANs with Score Matching Regularity [114.43433222721025]
Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex.<n>We find that score matching serves as a promising solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold.<n>We show that our approach can consistently boost the performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function.
arXiv Detail & Related papers (2023-11-30T03:05:14Z) - Proof Artifact Co-training for Theorem Proving with Language Models [4.934817254755007]
PACT (bf Proof bf Artifact bf Co-bf Training) is a general methodology for extracting self-supervised data from kernel-level proof terms for co-training.
We instrument Lean with a neural theorem prover driven by a Transformer language model and show that PACT improves theorem proving success rate on a held-out suite of test theorems from 32% to 48%.
arXiv Detail & Related papers (2021-02-11T18:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.