Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph
- URL: http://arxiv.org/abs/2510.04520v1
- Date: Mon, 06 Oct 2025 06:25:11 GMT
- Title: Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph
- Authors: Hanyu Wang, Ruohan Xie, Yutong Wang, Guoxiong Gao, Xintao Yu, Bin Dong,
- Abstract summary: We present Aria, a system for conjecture-level formalization in Lean.<n>Aria emulates human expert reasoning via a two-phase Graph-of-Thought process.<n>AriaScorer retrieves definitions from Mathlib for term-level grounding.
- Score: 7.606625807305093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions. To tackle these issues, we present Aria (Agent for Retrieval and Iterative Autoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce AriaScorer, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification. We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6% compilation success rate and 68.5% final accuracy, surpassing previous methods. On FATE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0% vs. 24.0% final accuracy. On a dataset of homological conjectures, Aria reaches 42.9% final accuracy while all other models score 0%.
Related papers
- Can We Predict Before Executing Machine Learning Agents? [74.39460101251792]
We formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons.<n>We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report.<n>We instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%.
arXiv Detail & Related papers (2026-01-09T16:44:17Z) - Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms [14.853721511192736]
Large language models (LLMs) have led to breakthroughs in automated mathematical reasoning and scientific discovery.<n>We present a benchmark of four frontier models: GP-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, and Grok-4.<n>We find that while the top-tier models achieve a high accuracy rate, other models lag significantly in consistency.
arXiv Detail & Related papers (2025-12-16T00:34:55Z) - Reliable Fine-Grained Evaluation of Natural Language Math Proofs [30.992321135182905]
We propose a systematic methodology for developing evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs.<n>We introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions.<n>Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method.
arXiv Detail & Related papers (2025-10-14T02:59:07Z) - Solving Inequality Proofs with Large Language Models [46.71658812761115]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z) - REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning [12.343823629674368]
We present REAL-Prover, a new open-source stepwise theorem prover for Lean 4.<n>Our prover notably boosts performance on solving college-level mathematics problems.<n>In experiments, our prover using only supervised fine-tune theorem achieves competitive results with a 23.7% success rate.
arXiv Detail & Related papers (2025-05-27T01:26:11Z) - LIMO: Less is More for Reasoning [23.312893016642096]
We demonstrate that sophisticated mathematical reasoning can emerge with only a few examples.<n>Our model, LIMO, achieves 63.3% accuracy on AIME24 and 95.6% on MATH500.<n>LIMO exhibits strong out-of-distribution generalization, achieving a 45.8% absolute improvement across diverse benchmarks.
arXiv Detail & Related papers (2025-02-05T17:23:45Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - Improving Autoformalization using Type Checking [15.58948808529849]
We analyze both current autoformalization methods and the processes used to evaluate them, focusing specifically on the Lean 4 theorem proving language.<n>We demonstrate that scaling type-check filtering with self-consistency techniques on top of existing methods significantly improves performance, achieving absolute accuracy gains of up to +18.4% on ProofNet.<n>We also release new benchmarks: a new research-level mathematics dataset RLM25, a corrected ProofNet, and ProofNetVerif with labeled correct and incorrect autoformalization pairs for evaluating metrics.
arXiv Detail & Related papers (2024-06-11T13:01:50Z) - DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [65.5290035371111]
We introduce an approach to generate extensive Lean 4 proof data derived from high-school and undergraduate-level mathematical competition problems.
We fine-tune the DeepSeekMath 7B model on this synthetic dataset, which comprises 8 million formal statements with proofs.
Our model successfully proved 5 out of 148 problems in the Lean 4 Formalized International Mathematical Olympiad (FIMO) benchmark, while GPT-4 failed to prove any.
arXiv Detail & Related papers (2024-05-23T09:03:42Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions.
We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs.
We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.