Related papers: INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

URL: http://arxiv.org/abs/2007.02924v2
Date: Sat, 3 Apr 2021 16:21:59 GMT
Title: INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving
Authors: Yuhuai Wu, Albert Qiaochu Jiang, Jimmy Ba, Roger Grosse
Abstract summary: In this paper, we introduce an INequality Theorem proving benchmark, specifically designed to test agents' generalization ability. Int is based on a procedure for generating theorems and proofs; this procedure's knobs allow us to measure 6 different types of generalization. We then evaluate the same agents augmented with Monte Carlo Tree Search (MCTS) at test time, and show that MCTS can help to prove new theorems.
Score: 36.194330645092634
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In learning-assisted theorem proving, one of the most critical challenges is to generalize to theorems unlike those seen at training time. In this paper, we introduce INT, an INequality Theorem proving benchmark, specifically designed to test agents' generalization ability. INT is based on a procedure for generating theorems and proofs; this procedure's knobs allow us to measure 6 different types of generalization, each reflecting a distinct challenge characteristic to automated theorem proving. In addition, unlike prior benchmarks for learning-assisted theorem proving, INT provides a lightweight and user-friendly theorem proving environment with fast simulations, conducive to performing learning-based and search-based research. We introduce learning-based baselines and evaluate them across 6 dimensions of generalization with the benchmark. We then evaluate the same agents augmented with Monte Carlo Tree Search (MCTS) at test time, and show that MCTS can help to prove new theorems.

Related papers

DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning [67.93945726549289]
DeepTheorem is a comprehensive informal theorem-proving framework exploiting natural language to enhance mathematical reasoning.<n>DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs.<n>We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference.
arXiv Detail & Related papers (2025-05-29T17:59:39Z)
A Combinatorial Identities Benchmark for Theorem Proving via Automated Theorem Generation [3.003569769097376]
Combinatorics is a cornerstone of mathematics, providing essential tools for analyzing discrete structures and solving problems. To address this, we manually construct LeanComb, an automated identities benchmark in Lean. We develop an Automated Theorem Generator for Combinatorial Identities, ATG4CI, which combines candidate tactics suggested by a self-improving large language model with a Reinforcement Learning Tree Search approach for prediction.
arXiv Detail & Related papers (2025-02-25T04:41:49Z)
Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation. For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it. As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z)
Lean-STaR: Learning to Interleave Thinking and Proving [53.923617816215774]
We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof. Lean-STaR achieves state-of-the-art results on the miniF2F-test benchmark within the Lean theorem proving environment.
arXiv Detail & Related papers (2024-07-14T01:43:07Z)
Proving Theorems Recursively [80.42431358105482]
We propose POETRY, which proves theorems in a level-by-level manner. Unlike previous step-by-step methods, POETRY searches for a sketch of the proof at each level. We observe a substantial increase in the maximum proof length found by POETRY, from 10 to 26.
arXiv Detail & Related papers (2024-05-23T10:35:08Z)
ATG: Benchmarking Automated Theorem Generation for Generative Language Models [83.93978859348313]
Humans can develop new theorems to explore broader and more complex mathematical results. Current generative language models (LMs) have achieved significant improvement in automatically proving theorems. This paper proposes an Automated Theorem Generation benchmark that evaluates whether an agent can automatically generate valuable (and possibly brand new) theorems.
arXiv Detail & Related papers (2024-05-05T02:06:37Z)
Representer Theorems for Metric and Preference Learning: A Geometric Perspective [0.0]
We obtain a novel representer theorem for the simultaneous task of metric and preference learning. We show that our framework can be applied to the task of metric learning from triplet comparisons. In the case of Reproducing Kernel Hilbert Spaces, we demonstrate that the solution to the learning problem can be expressed using kernel terms.
arXiv Detail & Related papers (2023-04-07T16:34:25Z)
Learning to Guide a Saturation-Based Theorem Prover [9.228237801323042]
TRAIL is a deep learning-based approach to theorem proving that characterizes core elements of saturation-based theorem proving within a neural framework. To the best of our knowledge, TRAIL is the first reinforcement learning-based approach to exceed the performance of a state-of-the-art traditional theorem prover.
arXiv Detail & Related papers (2021-06-07T18:35:57Z)
Learning to Prove Theorems by Learning to Generate Theorems [71.46963489866596]
We learn a neural generator that automatically synthesizes theorems and proofs for the purpose of training a theorem prover. Experiments on real-world tasks demonstrate that synthetic data from our approach improves the theorem prover.
arXiv Detail & Related papers (2020-02-17T16:06:02Z)
An Experimental Study of Formula Embeddings for Automated Theorem Proving in First-Order Logic [13.633012068936894]
We study and experimentally compare pattern-based embeddings that are applied in current systems with popular graph-based encodings. Our experiments show that the advantages of simpler encoding schemes in terms of runtime are outdone by more complex graph-based embeddings.
arXiv Detail & Related papers (2020-02-02T16:07:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.