IsarStep: a Benchmark for High-level Mathematical Reasoning
- URL: http://arxiv.org/abs/2006.09265v2
- Date: Wed, 24 Mar 2021 16:45:18 GMT
- Title: IsarStep: a Benchmark for High-level Mathematical Reasoning
- Authors: Wenda Li and Lei Yu and Yuhuai Wu and Lawrence C. Paulson
- Abstract summary: We present a benchmark for high-level mathematical reasoning and study the reasoning capabilities of neural sequence-to-sequence models.
We build a non-synthetic dataset from the largest repository of proofs written by human experts in a theorem prover.
- Score: 20.96474618260995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A well-defined benchmark is essential for measuring and accelerating research
progress of machine learning models. In this paper, we present a benchmark for
high-level mathematical reasoning and study the reasoning capabilities of
neural sequence-to-sequence models. We build a non-synthetic dataset from the
largest repository of proofs written by human experts in a theorem prover. The
dataset has a broad coverage of undergraduate and research-level mathematical
and computer science theorems. In our defined task, a model is required to fill
in a missing intermediate proposition given surrounding proofs. This task
provides a starting point for the long-term goal of having machines generate
human-readable proofs automatically. Our experiments and analysis reveal that
while the task is challenging, neural models can capture non-trivial
mathematical reasoning. We further design a hierarchical transformer that
outperforms the transformer baseline.
Related papers
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task [49.355810887265925]
We introduce MathFimer, a novel framework for mathematical reasoning step expansion.
We develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset.
We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains.
arXiv Detail & Related papers (2025-02-17T11:22:24Z) - Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning [85.635988711588]
We argue that enhancing the capabilities of large language models requires a paradigm shift in the design of mathematical datasets.
We advocate for mathematical dataset developers to consider the concept of "motivated proof", introduced by G. P'olya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal.
We provide a questionnaire designed specifically for math datasets that we urge creators to include with their datasets.
arXiv Detail & Related papers (2024-12-19T18:55:17Z) - Benchmarking Large Language Models with Integer Sequence Generation Tasks [1.3108652488669736]
This paper presents a novel benchmark where the large language model (LLM) must write code that computes integer sequences from the Online Encyclopedia of Sequences (OEIS)
Our benchmark reveals that the o1 series of models outperform other frontier models from OpenAI, Anthropic, Meta, and Google in accuracy and cheating rates across both easy and hard integer sequences.
arXiv Detail & Related papers (2024-11-07T02:05:43Z) - Discovering symbolic expressions with parallelized tree search [59.92040079807524]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.
Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade when handling problems of complexity.
We introduce a parallelized tree search (PTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative
Language Models [68.65075559137608]
We propose TRIGO, an ATP benchmark that not only requires a model to reduce a trigonometric expression with step-by-step proofs but also evaluates a generative LM's reasoning ability on formulas.
We gather trigonometric expressions and their reduced forms from the web, annotate the simplification process manually, and translate it into the Lean formal language system.
We develop an automatic generator based on Lean-Gym to create dataset splits of varying difficulties and distributions in order to thoroughly analyze the model's generalization ability.
arXiv Detail & Related papers (2023-10-16T08:42:39Z) - A Survey of Deep Learning for Mathematical Reasoning [71.88150173381153]
We review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade.
Recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning.
arXiv Detail & Related papers (2022-12-20T18:46:16Z) - Generative Language Modeling for Automated Theorem Proving [94.01137612934842]
This work is motivated by the possibility that a major limitation of automated theorem provers compared to humans might be addressable via generation from language models.
We present an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyze its performance.
arXiv Detail & Related papers (2020-09-07T19:50:10Z) - Learning to Prove from Synthetic Theorems [41.74768503409581]
A major challenge in applying machine learning to automated theorem proving is the scarcity of training data.
We propose an approach that relies on training with synthetic theorems, generated from a set of axioms.
We show that such theorems can be used to train an automated prover and that the learned prover transfers successfully to human-generated theorems.
arXiv Detail & Related papers (2020-06-19T17:48:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.