Related papers: TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

URL: http://arxiv.org/abs/2510.11944v1
Date: Mon, 13 Oct 2025 21:09:36 GMT
Title: TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition
Authors: Yupei Li, Philipp Borchert, Gerasimos Lampouras,
Abstract summary: We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs.<n>TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements.<n>We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks.
Score: 9.593530910909363
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

Related papers

MathlibLemma: Folklore Lemma Generation and Benchmark for Formal Mathematics [51.65118519899584]
We introduce MathlibLemma, the first LLM-based multi-agent system to automate the discovery and formalization of mathematical folklore lemmas.<n>We further construct the MathlibLemma benchmark, a suite of 4,028 type-checked Lean statements spanning a broad range of mathematical domains.
arXiv Detail & Related papers (2026-01-30T15:24:42Z)
DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems [14.568293842955065]
DRIFT is a framework that decomposes informal mathematical statements into smaller, more tractable ''sub-components''<n>It retrieves illustrative theorems to help models use premises more effectively in formalization tasks.<n>We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval.
arXiv Detail & Related papers (2025-10-12T21:42:04Z)
Automated Formalization via Conceptual Retrieval-Augmented LLMs [19.328918823576153]
CRAMF is a Concept-driven Retrieval-Augmented Mathematical Formalization framework.<n>We introduce a framework for automatically constructing a concept-definition knowledge base from Mathlib4.<n>Experiments on miniF2F, ProofNet, and our newly proposed AdvancedMath benchmark show that CRAMF can be seamlessly integrated into LLM-based autoformalizers.
arXiv Detail & Related papers (2025-08-09T10:54:25Z)
Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [78.1575956773948]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z)
Neural Theorem Proving: Generating and Structuring Proofs for Formal Verification [0.26763498831034044]
We introduce a framework that generates whole proofs in a formal language to be used within systems that utilize the power of built-in tactics and off-the-shelf automated theorem provers.<n>To train the LLM, we employ a 2-stage finetuning process, where we first use SFT-based training to enable the model to generate syntactically correct Isabelle code.<n>We validate our framework using the miniF2F-test benchmark and the Isabelle proof assistant and design a use case to verify the correctness of the S3 bucket access policy code.
arXiv Detail & Related papers (2025-04-23T18:04:38Z)
Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving [72.8626512877667]
We introduce Goedel-Prover, an open-source language model that achieves state-of-the-art (as of April 5 2025) performance in automated formal proof generation for mathematical problems.<n>First, we train LLMs to convert natural language math problems from the Numina dataset to equivalent formal statements in Lean 4.<n>Next, we develop a large dataset of formal proofs by training a series of provers.<n>Finally, we obtain the dataset Goedel-Pset-v1-solved, which contains proofs for over 800K statements from Goedel-Pset-v1
arXiv Detail & Related papers (2025-02-11T15:27:35Z)
TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts [26.98890165420689]
TheoremLlama is an end-to-end framework that trains a general-purpose Lean4 expert. Our framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively.
arXiv Detail & Related papers (2024-07-03T15:36:18Z)
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z)
Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization [45.439933713342256]
Large language models (LLM) are becoming increasingly capable of solving mathematical quantitative reasoning problems. We leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics, they can be prompted to translate into formal Isabelle code. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement.
arXiv Detail & Related papers (2024-03-26T22:01:13Z)
InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning [98.53491178426492]
We open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2. We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format. Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning.
arXiv Detail & Related papers (2024-02-09T11:22:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.