Related papers: LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4

LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4

URL: http://arxiv.org/abs/2507.14722v1
Date: Sat, 19 Jul 2025 18:50:07 GMT
Title: LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4
Authors: Matěj Kripner, Michal Šustr, Milan Straka,
Abstract summary: We introduce LeanTree, a tool built in the Lean 4 language that factorizes complex proof states into simpler, independent branches.<n>Our preliminary results hint that white-box approaches outperform black-box alternatives in some settings.
Score: 0.9619839248195652
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automated theorem proving (ATP) has been a classical problem in artificial intelligence since its inception, yet it remains challenging due to its vast state and action space. Large language models (LLMs) have recently emerged as a promising heuristic for ATP, but they lack correctness guarantees and thus require interaction with a proof verifier. Such interactions typically follow one of two approaches: black-box interaction, which does not utilize intermediate proof states, or white-box approaches, which allow for incremental proof construction and examination of intermediate states. While black-box approaches have directly benefited from recent LLM advances, white-box methods have comparatively lagged behind. In this paper, we address this gap by introducing LeanTree, which consists of (i) a tool built in the Lean 4 language that factorizes complex proof states into simpler, independent branches, and (ii) a dataset of these factorized intermediate states. Our white-box tooling offers several advantages over black-box approaches: it simplifies evaluation, reduces necessary context, generates richer training data, enables parallel search across multiple states, supports efficient reuse of states, and provides feedback in case of errors. Our preliminary results hint that white-box approaches outperform black-box alternatives in some settings.

Related papers

Prism: Efficient Test-Time Scaling via Hierarchical Search and Self-Verification for Discrete Diffusion Language Models [96.0074341403456]
Inference-time compute has re-emerged as a practical way to improve LLM reasoning.<n>Most test-time scaling (TTS) algorithms rely on autoregressive decoding.<n>We propose Prism, an efficient TTS framework for dLLMs.
arXiv Detail & Related papers (2026-02-02T09:14:51Z)
BRIDGE: Building Representations In Domain Guided Program Verification [67.36686119518441]
BRIDGE decomposes verification into three interconnected domains: Code, Specifications, and Proofs.<n>We show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods.
arXiv Detail & Related papers (2025-11-26T06:39:19Z)
HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs [32.234133057592935]
Hermes is a tool-assisted agent that interleaves informal reasoning with verified proof steps in Lean systems.<n>We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales.
arXiv Detail & Related papers (2025-11-24T04:50:18Z)
ProofBridge: Auto-Formalization of Natural Language Proofs in Lean via Joint Embeddings [9.764411884491052]
We present ProofBridge, a framework for automatically translating entire NL theorems and proofs into Lean 4.<n>At its core is a joint embedding model that aligns NL and FL (NL-FL) theorem-proof pairs in a shared semantic space.<n>Our training ensures NL-FL theorems are mapped close together in this space if and only if the NL-FL pairs are semantically equivalent.
arXiv Detail & Related papers (2025-10-17T14:20:50Z)
CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z)
Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction [30.76377830825308]
Existing tasks fall short in evaluating reasoning ability of Large Language Models (LLMs) in an interactive, unknown environment.<n>This deficiency leads to the isolated assessment of deductive, inductive, and abductive reasoning.<n>We introduce a novel evaluation paradigm, textitblack-box interaction, to tackle this challenge.
arXiv Detail & Related papers (2025-08-26T13:54:17Z)
Solving Inequality Proofs with Large Language Models [46.71658812761115]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z)
Auditing Black-Box LLM APIs with a Rank-Based Uniformity Test [24.393978712663618]
API providers may discreetly serve quantized or fine-tuned variants to reduce costs or maliciously alter model behaviors.<n>We propose a rank-based uniformity test that can verify the behavioral equality of a black-box LLM to a locally deployed authentic model.<n>We evaluate the approach across diverse threat scenarios, including quantization, harmful fine-tuning, jailbreak prompts, and full model substitution.
arXiv Detail & Related papers (2025-06-08T03:00:31Z)
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization [69.29207684569695]
Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs)<n>Existing approaches are focused on text CoT, limiting their ability to leverage visual cues.<n>In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization.
arXiv Detail & Related papers (2025-04-25T14:48:18Z)
Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs [60.881609323604685]
Large Language Models (LLMs) accessed via black-box APIs introduce a trust challenge.<n>Users pay for services based on advertised model capabilities.<n> providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs.<n>This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking.
arXiv Detail & Related papers (2025-04-07T03:57:41Z)
MAFT: Efficient Model-Agnostic Fairness Testing for Deep Neural Networks via Zero-Order Gradient Search [20.48306648223519]
We propose a novel black-box individual fairness testing method called Model-Agnostic Fairness Testing (MAFT)<n>We demonstrate that MAFT achieves the same effectiveness as state-of-the-art white-box methods whilst improving the applicability to large-scale networks.
arXiv Detail & Related papers (2024-12-28T09:07:06Z)
A Probabilistic Framework for LLM Hallucination Detection via Belief Tree Propagation [72.93327642336078]
We propose Belief Tree Propagation (BTProp), a probabilistic framework for hallucination detection.<n>BTProp introduces a belief tree of logically related statements by decomposing a parent statement into child statements.<n>Our method improves baselines by 3%-9% (evaluated by AUROC and AUC-PR) on multiple hallucination detection benchmarks.
arXiv Detail & Related papers (2024-06-11T05:21:37Z)
Bisimulation Learning [55.859538562698496]
We compute finite bisimulations of state transition systems with large, possibly infinite state space. Our technique yields faster verification results than alternative state-of-the-art tools in practice.
arXiv Detail & Related papers (2024-05-24T17:11:27Z)
Theoretically Achieving Continuous Representation of Oriented Bounding Boxes [64.15627958879053]
This paper endeavors to completely solve the issue of discontinuity in Oriented Bounding Box representation. We propose a novel representation method called Continuous OBB (COBB) which can be readily integrated into existing detectors. For fairness and transparency of experiments, we have developed a modularized benchmark based on the open-source deep learning framework Jittor's detection toolbox JDet for OOD evaluation.
arXiv Detail & Related papers (2024-02-29T09:27:40Z)
Ten Words Only Still Help: Improving Black-Box AI-Generated Text Detection via Proxy-Guided Efficient Re-Sampling [19.780068724002888]
POGER is a proxy-guided efficient re-sampling method for black-box AIGT detection. It outperforms all baselines in macro F1 under black-box, partial white-box, and out-of-distribution settings.
arXiv Detail & Related papers (2024-02-14T14:32:16Z)
Language Models as Black-Box Optimizers for Vision-Language Models [62.80817942316398]
Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. We aim to develop a black-box approach to optimize VLMs through natural language prompts.
arXiv Detail & Related papers (2023-09-12T04:03:41Z)
PromptBoosting: Black-Box Text Classification with Ten Forward Passes [61.38341243907045]
We describe PromptBoosting, a query-efficient procedure for building a text classifier from a neural language model (LM) without access to the LM's parameters, gradients, or hidden representations. Experiments show that PromptBoosting achieves state-of-the-art performance in multiple black-box few-shot classification tasks, and matches or outperforms full fine-tuning in both few-shot and standard learning paradigms, while training 10x faster than existing black-box methods.
arXiv Detail & Related papers (2022-12-19T06:04:54Z)
Automating Reasoning with Standpoint Logic via Nested Sequents [0.0]
Standpoint logic advocates a multi-perspective approach permitting reasoning with a selection of diverse and possibly conflicting standpoints. We show how to automate standpoint reasoning by means of non-deterministic proof-search algorithms.
arXiv Detail & Related papers (2022-05-05T16:27:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.