Conjecturing: An Overlooked Step in Formal Mathematical Reasoning
- URL: http://arxiv.org/abs/2510.11986v1
- Date: Mon, 13 Oct 2025 22:29:49 GMT
- Title: Conjecturing: An Overlooked Step in Formal Mathematical Reasoning
- Authors: Jasivan Alex Sivakumar, Philipp Borchert, Ronald Cardenas, Gerasimos Lampouras,
- Abstract summary: Many mathematical problems cannot be formalised directly without first conjecturing a conclusion.<n>We develop ConjectureBench to measure the conjecturing capabilities of Large Language Models.<n>We show that our method achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1.
- Score: 10.398167568549924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.
Related papers
- Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification [49.506412445511934]
Large Language Models (LLMs) show remarkable capabilities, yet their next-token prediction creates logical inconsistencies and reward hacking.<n>We introduce a formal logic verification-guided framework that dynamically interleaves formal symbolic verification with the natural language generation process.<n>We operationalize this framework via a novel two-stage training pipeline that synergizes formal logic verification-guided supervised fine-tuning and policy optimization.
arXiv Detail & Related papers (2026-01-30T07:01:25Z) - ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization [73.0780809974414]
We propose a Reflective Autoformalization method that integrates semantic consistency evaluation into the autoformalization process.<n>This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors.<n>Experiments show that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines.
arXiv Detail & Related papers (2025-10-28T16:22:54Z) - Autoformalizer with Tool Feedback [52.334957386319864]
Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements.<n>Existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency.<n>We propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process.
arXiv Detail & Related papers (2025-10-08T10:25:12Z) - FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory [44.64175433092553]
Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving.<n>Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored.<n>We introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning.
arXiv Detail & Related papers (2025-09-26T14:40:14Z) - Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z) - Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny [78.1575956773948]
Large Language Models (LLMs) trained with Reinforcement Learning (RL) face a significant challenge: their verification processes are neither reliable nor scalable.<n>A promising yet largely uncharted alternative is formal language-based reasoning.<n>Grounding LLMs in rigorous formal systems where generative models operate in formal language spaces (e.g., Dafny) enables the automatic and mathematically provable verification of their reasoning processes and outcomes.
arXiv Detail & Related papers (2025-07-22T08:13:01Z) - Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning [8.135142928659546]
We introduce a systematic, automatic method to evaluate autoformalization tasks.<n>The proposed method is based on an ensemble of judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ)<n>Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation.
arXiv Detail & Related papers (2025-06-12T17:09:51Z) - CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z) - FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models [17.919212265668783]
We present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems.<n>We introduce a novel human-in-the-loop autoformalization pipeline that integrates statement autoformalization, semantic verification, and negation-based disproof filtering strategies.<n>Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations.
arXiv Detail & Related papers (2025-05-05T15:37:00Z) - Improving Autoformalization using Type Checking [15.58948808529849]
We analyze both current autoformalization methods and the processes used to evaluate them, focusing specifically on the Lean 4 theorem proving language.<n>We demonstrate that scaling type-check filtering with self-consistency techniques on top of existing methods significantly improves performance, achieving absolute accuracy gains of up to +18.4% on ProofNet.<n>We also release new benchmarks: a new research-level mathematics dataset RLM25, a corrected ProofNet, and ProofNetVerif with labeled correct and incorrect autoformalization pairs for evaluating metrics.
arXiv Detail & Related papers (2024-06-11T13:01:50Z) - Autoformalization with Large Language Models [22.86710743804944]
A successful autoformalization system could advance the fields of formal verification, program synthesis, and artificial intelligence.
We show large language models provide new prospects towards this goal.
Our methodology results in a new state-of-the-art result on the MiniF2F theorem proving benchmark, improving the proof rate from $29.6%$ to $35.2%$.
arXiv Detail & Related papers (2022-05-25T09:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.