Related papers: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

URL: http://arxiv.org/abs/2411.04872v4
Date: Fri, 22 Nov 2024 18:00:00 GMT
Title: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI
Authors: Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, Mark Wildon,
Abstract summary: FrontierMath is a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.
Score: 2.0608396919601493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

Related papers

Proof2Hybrid: Automatic Mathematical Benchmark Synthesis for Proof-Centric Problems [9.041749463376599]
We propose Proof2Hybrid, a framework that synthesizes high-quality, proof-centric benchmarks from natural language mathematical corpora.<n>Our framework and benchmark pave the way for a new wave of in-depth research into the mathematical intelligence of AI systems.
arXiv Detail & Related papers (2025-08-04T08:59:36Z)
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics [21.453837660747844]
Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions.<n>We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks.
arXiv Detail & Related papers (2025-05-18T23:32:46Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
Formal Mathematical Reasoning: A New Frontier in AI [60.26950681543385]
We advocate for formal mathematical reasoning and argue that it is indispensable for advancing AI4Math to the next level. We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success.
arXiv Detail & Related papers (2024-12-20T17:19:24Z)
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z)
Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks [34.09857430966818]
We introduce an extensive mathematics dataset called "MathQuest" sourced from the 11th and 12th standard Mathematics NCERT textbooks. We conduct fine-tuning experiments with three prominent large language models: LLaMA-2, WizardMath, and MAmmoTH. Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems.
arXiv Detail & Related papers (2024-04-19T08:45:42Z)
FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models [44.63505885248145]
FineMath is a fine-grained mathematical evaluation benchmark dataset for assessing Chinese Large Language Models (LLMs) FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are divided into 17 categories of math word problems. All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems.
arXiv Detail & Related papers (2024-03-12T15:32:39Z)
Machine learning and information theory concepts towards an AI Mathematician [77.63761356203105]
The current state-of-the-art in artificial intelligence is impressive, especially in terms of mastery of language, but not so much in terms of mathematical reasoning. This essay builds on the idea that current deep learning mostly succeeds at system 1 abilities. It takes an information-theoretical posture to ask questions about what constitutes an interesting mathematical statement.
arXiv Detail & Related papers (2024-03-07T15:12:06Z)
MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving. However, their proficiency in solving mathematical problems remains inadequate. We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data.
arXiv Detail & Related papers (2024-03-05T11:42:59Z)
Math Agents: Computational Infrastructure, Mathematical Embedding, and Genomics [0.0]
Beyond human-AI chat, large language models (LLMs) are emerging in programming, algorithm discovery, and theorem proving. This project introduces Math Agents and mathematical embedding as fresh entries to the "Moore's Law of Mathematics" Project aims to use Math Agents and mathematical embeddings to address the ageing issue in information systems biology.
arXiv Detail & Related papers (2023-07-04T20:16:32Z)
A Survey of Deep Learning for Mathematical Reasoning [71.88150173381153]
We review the key tasks, datasets, and methods at the intersection of mathematical reasoning and deep learning over the past decade. Recent advances in large-scale neural language models have opened up new benchmarks and opportunities to use deep learning for mathematical reasoning.
arXiv Detail & Related papers (2022-12-20T18:46:16Z)
JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding [74.12405417718054]
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model(PLM) Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement. We design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses.
arXiv Detail & Related papers (2022-06-13T17:03:52Z)
A Neural Network Solves and Generates Mathematics Problems by Program Synthesis: Calculus, Differential Equations, Linear Algebra, and More [8.437319139670116]
We turn questions into programming tasks, automatically generate programs, and then execute them. This is the first work to automatically solve, grade, and generate university-level Mathematics course questions at scale.
arXiv Detail & Related papers (2021-12-31T18:57:31Z)
Measuring Mathematical Problem Solving With the MATH Dataset [55.4376028963537]
We introduce MATH, a dataset of 12,500 challenging competition mathematics problems. Each problem has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. We also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics.
arXiv Detail & Related papers (2021-03-05T18:59:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.