Related papers: TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow

TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow

URL: http://arxiv.org/abs/2601.17332v1
Date: Sat, 24 Jan 2026 06:28:11 GMT
Title: TheoremForge: Scaling up Formal Data Synthesis with Low-Budget Agentic Workflow
Authors: Yicheng Tao, Hongteng Xu,
Abstract summary: We introduce textbfTheoremForge, a cost-effective formal data synthesis pipeline.<n>Our strategy increases data yield by textbf1.6$times$ for proof generation compared to standard filtering.<n>Results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models.
Score: 32.80602352741069
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The high cost of agentic workflows in formal mathematics hinders large-scale data synthesis, exacerbating the scarcity of open-source corpora. To address this, we introduce \textbf{TheoremForge}, a cost-effective formal data synthesis pipeline that decomposes the formalization process into five sub-tasks, which are \textit{statement formalization}, \textit{proof generation}, \textit{premise selection}, \textit{proof correction} and \textit{proof sketching}. By implementing a \textit{Decoupled Extraction Strategy}, the workflow recovers valid training signals from globally failed trajectories, effectively utilizing wasted computation. Experiments on a 2,000-problem benchmark demonstrate that TheoremForge achieves a Verified Rate of 12.6\%, surpassing the 8.6\% baseline, at an average cost of only \textbf{\$0.481} per successful trajectory using Gemini-3-Flash. Crucially, our strategy increases data yield by \textbf{1.6$\times$} for proof generation compared to standard filtering. These results establish TheoremForge as a scalable framework for constructing a data flywheel to train future expert models. Our code is available \href{https://github.com/timechess/TheoremForge}{here}.

Related papers

LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks [4.6880826836662814]
We introduce textbfLOGIGEN, a logic-driven framework that synthesizes verifiable training data.<n>On $2$-Bench, LOGIGEN-32B(RL) achieves a textbf79.5% success rate, substantially outperforming the base model.
arXiv Detail & Related papers (2026-02-28T08:35:30Z)
Generative Data Transformation: From Mixed to Unified Data [57.84692191369066]
textscTaesar is a emphdata-centric framework for textbftarget-textbfal textbfregeneration.<n>It encodes cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures.
arXiv Detail & Related papers (2026-02-26T08:30:09Z)
The 4/$δ$ Bound: Designing Predictable LLM-Verifier Systems for Formal Method Guarantee [5.345468714252351]
This work bridges the gap by developing an LLM-Verifier Convergence Theorem.<n>We model the interaction between the LLM and the verifier as a discrete-time Markov Chain.<n>We stress-tested this prediction in an extensive empirical campaign comprising more than 90,000 trials.
arXiv Detail & Related papers (2025-11-30T22:19:09Z)
Generative Myopia: Why Diffusion Models Fail at Structure [0.6768558752130311]
Graph Diffusion Models (GDMs) optimize for statistical likelihood, implicitly acting as textbffrequency filtersGenerative Myopia.<n>We prove theoretically and empirically that this failure is driven by textbfGradient Starvation.<n>We introduce textbfSpectrally-Weighted Diffusion, which re-aligns the variational objective using Effective Resistance.
arXiv Detail & Related papers (2025-11-23T19:27:13Z)
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs [54.231935013127206]
Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT)<n>We propose modeling CoT as a certain rule-based process over directed acyclic graphs (DAGs)<n>We introduce logical closeness, a metric that quantifies how well a model's CoT trajectory adheres to the DAG structure.
arXiv Detail & Related papers (2025-10-19T21:05:17Z)
Saturation-Driven Dataset Generation for LLM Mathematical Reasoning in the TPTP Ecosystem [2.9458822941489387]
E-prover's saturation capabilities on the vast TPTP axiom library are used to derive a massive, guaranteed-valid corpus of theorems.<n>Our pipeline is principled and simple: axiom saturates, filter for "interesting" theorems, and generate tasks.<n>This purely symbolic data is then transformed into three difficulty-controlled challenges: entailment verification, premise selection, and proof reconstruction.
arXiv Detail & Related papers (2025-09-08T15:43:29Z)
FormaRL: Enhancing Autoformalization with no Labeled Data [5.607770363395876]
We propose textbfFormaRL, a simple yet efficient reinforcement learning framework for autoformalization.<n>FormaRL integrates syntax check from Lean compiler and consistency check from large language model to calculate the reward.<n>Experiments show that FormaRL can increase the pass@1 autoformalization accuracy of Qwen2.5-Coder-7B-Instruct by 4 $sim$ 6x.
arXiv Detail & Related papers (2025-08-26T10:38:18Z)
Byzantine Failures Harm the Generalization of Robust Distributed Learning Algorithms More Than Data Poisoning [19.624245500772027]
A robust distributed learning algorithm aims to maintain reliable performance despite the presence of misbehaving workers.<n>Such misbehaviors are commonly modeled as $textitByzantine failures$, allowing arbitrarily corrupted communication, or as $textitdata poisoning$, a weaker form of corruption restricted to local training data.<n>We show for the first time, a fundamental gap in generalization guarantees between the two threat models: Byzantine failures yield strictly worse rates than those achievable under data poisoning.
arXiv Detail & Related papers (2025-06-22T12:59:15Z)
TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation [50.23504065567638]
This paper introduces textbfTD3, a novel textbfDataset textbfDistillation method within a meta-learning framework.<n> TD3 distills a fully expressive emphsynthetic sequence summary from original data.<n>An augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the emphouter-loop.
arXiv Detail & Related papers (2025-02-05T03:13:25Z)
Rectified Diffusion Guidance for Conditional Generation [94.83538269086613]
We revisit the theory behind CFG and rigorously confirm that the improper combination coefficients (textiti.e.) brings about expectation shift the generative distribution.<n>We show that our approach enjoys a textbftextitform solution given the strength.<n> Empirical evidence on real-world data demonstrate the compatibility of our design with existing state-of-the-art diffusion models.
arXiv Detail & Related papers (2024-10-24T13:41:32Z)
Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
SMaRt: Improving GANs with Score Matching Regularity [114.43433222721025]
Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex.<n>We find that score matching serves as a promising solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold.<n>We show that our approach can consistently boost the performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function.
arXiv Detail & Related papers (2023-11-30T03:05:14Z)
Proof Artifact Co-training for Theorem Proving with Language Models [4.934817254755007]
PACT (bf Proof bf Artifact bf Co-bf Training) is a general methodology for extracting self-supervised data from kernel-level proof terms for co-training. We instrument Lean with a neural theorem prover driven by a Transformer language model and show that PACT improves theorem proving success rate on a held-out suite of test theorems from 32% to 48%.
arXiv Detail & Related papers (2021-02-11T18:59:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.