Related papers: DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems

DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems

URL: http://arxiv.org/abs/2510.10815v1
Date: Sun, 12 Oct 2025 21:42:04 GMT
Title: DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems
Authors: Meiru Zhang, Philipp Borchert, Milan Gritta, Gerasimos Lampouras,
Abstract summary: DRIFT is a framework that decomposes informal mathematical statements into smaller, more tractable ''sub-components''<n>It retrieves illustrative theorems to help models use premises more effectively in formalization tasks.<n>We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval.
Score: 14.568293842955065
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automating the formalization of mathematical statements for theorem proving remains a major challenge for Large Language Models (LLMs). LLMs struggle to identify and utilize the prerequisite mathematical knowledge and its corresponding formal representation in languages like Lean. Current retrieval-augmented autoformalization methods query external libraries using the informal statement directly, but overlook a fundamental limitation: informal mathematical statements are often complex and offer limited context on the underlying math concepts. To address this, we introduce DRIFT, a novel framework that enables LLMs to decompose informal mathematical statements into smaller, more tractable ''sub-components''. This facilitates targeted retrieval of premises from mathematical libraries such as Mathlib. Additionally, DRIFT retrieves illustrative theorems to help models use premises more effectively in formalization tasks. We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval, nearly doubling the F1 score compared to the DPR baseline on ProofNet. Notably, DRIFT demonstrates strong performance on the out-of-distribution ConNF benchmark, with BEq+@10 improvements of 37.14% and 42.25% using GPT-4.1 and DeepSeek-V3.1, respectively. Our analysis shows that retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model's capabilities.

Related papers

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
DAG-Math: Graph-Guided Mathematical Reasoning in LLMs [54.231935013127206]
Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT)<n>We propose modeling CoT as a certain rule-based process over directed acyclic graphs (DAGs)<n>We introduce logical closeness, a metric that quantifies how well a model's CoT trajectory adheres to the DAG structure.
arXiv Detail & Related papers (2025-10-19T21:05:17Z)
TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition [9.593530910909363]
We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs.<n>TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements.<n>We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks.
arXiv Detail & Related papers (2025-10-13T21:09:36Z)
Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z)
Automated Formalization via Conceptual Retrieval-Augmented LLMs [19.328918823576153]
CRAMF is a Concept-driven Retrieval-Augmented Mathematical Formalization framework.<n>We introduce a framework for automatically constructing a concept-definition knowledge base from Mathlib4.<n>Experiments on miniF2F, ProofNet, and our newly proposed AdvancedMath benchmark show that CRAMF can be seamlessly integrated into LLM-based autoformalizers.
arXiv Detail & Related papers (2025-08-09T10:54:25Z)
SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z)
Solving Formal Math Problems by Decomposition and Iterative Reflection [30.54275542622631]
textbfDelta Prover orchestrates the interaction between a general-purpose LLM and the Lean 4 proof environment.<n>bftextDelta Prover achieves a state-of-the-art 95.9% success rate on the miniF2F-test benchmark.
arXiv Detail & Related papers (2025-07-21T03:56:35Z)
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models [17.919212265668783]
We present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems.<n>We introduce a novel human-in-the-loop autoformalization pipeline that integrates statement autoformalization, semantic verification, and negation-based disproof filtering strategies.<n>Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations.
arXiv Detail & Related papers (2025-05-05T15:37:00Z)
Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization [29.06255449960557]
We introduce a novel method that leverages data extracted from Mathlib to train a lightweight and effective premise retrieval model.<n>The model is learned in a contrastive learning framework, in which a fine-grained similarity calculation method and a re-ranking module are applied.<n> Experimental results demonstrate that our model outperforms existing baselines, achieving higher accuracy while maintaining a lower computational load.
arXiv Detail & Related papers (2025-01-21T06:32:25Z)
TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language Models [68.65075559137608]
We propose TRIGO, an ATP benchmark that not only requires a model to reduce a trigonometric expression with step-by-step proofs but also evaluates a generative LM's reasoning ability on formulas. We gather trigonometric expressions and their reduced forms from the web, annotate the simplification process manually, and translate it into the Lean formal language system. We develop an automatic generator based on Lean-Gym to create dataset splits of varying difficulties and distributions in order to thoroughly analyze the model's generalization ability.
arXiv Detail & Related papers (2023-10-16T08:42:39Z)
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models [72.54339382005732]
Large language models (LLMs) have shown promise in proving formal theorems using proof assistants such as Lean. Existing methods are difficult to reproduce or build on, due to private code, data, and compute requirements. This paper introduces LeanDojo: an open-source Lean toolkit consisting of toolkits, data, models. We develop ReProver: an LLM-based prover augmented with retrieval for selecting premises from a vast math library.
arXiv Detail & Related papers (2023-06-27T17:05:32Z)
Proof Artifact Co-training for Theorem Proving with Language Models [4.934817254755007]
PACT (bf Proof bf Artifact bf Co-bf Training) is a general methodology for extracting self-supervised data from kernel-level proof terms for co-training. We instrument Lean with a neural theorem prover driven by a Transformer language model and show that PACT improves theorem proving success rate on a held-out suite of test theorems from 32% to 48%.
arXiv Detail & Related papers (2021-02-11T18:59:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.