Advocate for Complete Benchmarks for Formal Reasoning with Formal/Informal Statements and Formal/Informal Proofs
- URL: http://arxiv.org/abs/2507.04719v1
- Date: Mon, 07 Jul 2025 07:27:45 GMT
- Title: Advocate for Complete Benchmarks for Formal Reasoning with Formal/Informal Statements and Formal/Informal Proofs
- Authors: Roozbeh Yousefzadeh, Xuenan Cao,
- Abstract summary: We take the position that open code, open data, and benchmarks that are complete and error-free will accelerate progress in this field.<n>We identify practices that create barriers to contributing to this field and suggest ways to remove them.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This position paper provides a critical but constructive discussion of current practices in benchmarking and evaluative practices in the field of formal reasoning and automated theorem proving. We take the position that open code, open data, and benchmarks that are complete and error-free will accelerate progress in this field. We identify practices that create barriers to contributing to this field and suggest ways to remove them. We also discuss some of the practices that might produce misleading evaluative information. We aim to create discussions that bring together people from various groups contributing to automated theorem proving, autoformalization, and informal reasoning.
Related papers
- Beyond Agreement: Rethinking Ground Truth in Educational AI Annotation [1.8434042562191815]
We argue that overreliance on human inter-rater reliability (IRR) as a gatekeeper for annotation quality hampers progress in classifying data.<n>We highlight five examples of complementary evaluation methods, such as multi-label annotation schemes, expert-based approaches, and close-the-loop validity.<n>We call on the field to rethink annotation quality and ground truth--prioritizing validity and educational impact over consensus alone.
arXiv Detail & Related papers (2025-07-31T20:05:26Z) - Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning [8.135142928659546]
We introduce a systematic, automatic method to evaluate autoformalization tasks.<n>The proposed method is based on an ensemble of judges, defined on criteria encompassing logical preservation (LP), mathematical consistency (MC), formal validity (FV), and formal quality (FQ)<n>Overall, our experiments demonstrate that the EFG ensemble of LLM judges is a suitable emerging proxy for evaluation.
arXiv Detail & Related papers (2025-06-12T17:09:51Z) - Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof.<n>Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z) - LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Evaluating LLM-driven User-Intent Formalization for Verification-Aware Languages [6.0608817611709735]
We propose a metric for evaluating the quality of specifications for verification-aware languages.
We show that our metric agrees closely on a human-labeled dataset of Dafny specifications for the popular MBPP code-generation benchmark.
We also outline formal verification challenges that need to be addressed to apply the technique more widely.
arXiv Detail & Related papers (2024-06-14T06:52:08Z) - Improving Autoformalization using Type Checking [15.58948808529849]
We analyze both current autoformalization methods and the processes used to evaluate them, focusing specifically on the Lean 4 theorem proving language.<n>We demonstrate that scaling type-check filtering with self-consistency techniques on top of existing methods significantly improves performance, achieving absolute accuracy gains of up to +18.4% on ProofNet.<n>We also release new benchmarks: a new research-level mathematics dataset RLM25, a corrected ProofNet, and ProofNetVerif with labeled correct and incorrect autoformalization pairs for evaluating metrics.
arXiv Detail & Related papers (2024-06-11T13:01:50Z) - DELTA: Pre-train a Discriminative Encoder for Legal Case Retrieval via Structural Word Alignment [55.91429725404988]
We introduce DELTA, a discriminative model designed for legal case retrieval.
We leverage shallow decoders to create information bottlenecks, aiming to enhance the representation ability.
Our approach can outperform existing state-of-the-art methods in legal case retrieval.
arXiv Detail & Related papers (2024-03-27T10:40:14Z) - A Unifying Framework for Learning Argumentation Semantics [47.84663434179473]
We present a novel framework, which uses an Inductive Logic Programming approach to learn the acceptability semantics for several abstract and structured argumentation frameworks in an interpretable way.<n>Our framework outperforms existing argumentation solvers, thus opening up new future research directions in the area of formal argumentation and human-machine dialogues.
arXiv Detail & Related papers (2023-10-18T20:18:05Z) - Abductive Commonsense Reasoning Exploiting Mutually Exclusive
Explanations [118.0818807474809]
Abductive reasoning aims to find plausible explanations for an event.
Existing approaches for abductive reasoning in natural language processing often rely on manually generated annotations for supervision.
This work proposes an approach for abductive commonsense reasoning that exploits the fact that only a subset of explanations is correct for a given context.
arXiv Detail & Related papers (2023-05-24T01:35:10Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z) - Probing as Quantifying the Inductive Bias of Pre-trained Representations [99.93552997506438]
We present a novel framework for probing where the goal is to evaluate the inductive bias of representations for a particular task.
We apply our framework to a series of token-, arc-, and sentence-level tasks.
arXiv Detail & Related papers (2021-10-15T22:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.