Related papers: LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

URL: http://arxiv.org/abs/2602.24173v1
Date: Fri, 27 Feb 2026 16:52:52 GMT
Title: LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics
Authors: Antoine Peyronnet, Fabian Gloeckle, Amaury Hayat,
Abstract summary: We present a new approach for benchmarking Large Language Model capabilities on research-level mathematics.<n>Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research.<n>Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics.
Score: 5.676144562388248
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research. Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics. This consists of an automatic pipeline that extracts lemmas from arXiv and rewrites them into self-contained statements by making all assumptions and required definitions explicit. It results in a benchmark that can be updated regularly with new problems taken directly from human mathematical research, while previous instances can be used for training without compromising future evaluations. We benchmark current state-of-the-art LLMs, which obtain around 10-15$\%$ accuracy in theorem proving (pass@1) depending on the model, showing that there is currently a large margin of progression for LLMs to reach human-level proving capabilities in a research context.

Related papers

EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery [23.517907682810932]
We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning.<n>The pipeline transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks.<n>Applying this pipeline yields textbfEternalMath, an evolving evaluation suite derived from contemporary research papers.
arXiv Detail & Related papers (2026-01-04T06:40:25Z)
MathArena: Evaluating LLMs on Uncontaminated Math Competitions [4.655668424508813]
MathArena is a new benchmark for evaluating large language models (LLMs)<n>It is based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems.<n>MathArena is also the first benchmark for proof-writing capabilities.
arXiv Detail & Related papers (2025-05-29T09:28:06Z)
RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics [30.778394290919582]
Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions.<n>We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks.
arXiv Detail & Related papers (2025-05-18T23:32:46Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
LeanAgent: Lifelong Learning for Formal Theorem Proving [85.39415834798385]
We present LeanAgent, a novel lifelong learning framework for formal theorem proving.<n>LeanAgent continuously generalizes to and improves on ever-expanding mathematical knowledge.<n>It generates formal proofs for 155 theorems across 23 diverse Lean repositories.
arXiv Detail & Related papers (2024-10-08T17:11:24Z)
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models [34.814875040792344]
We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs) Mathador-LM is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances emphdynamically, following a target difficulty level.
arXiv Detail & Related papers (2024-06-18T13:02:12Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.