EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery
- URL: http://arxiv.org/abs/2601.01400v1
- Date: Sun, 04 Jan 2026 06:40:25 GMT
- Title: EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery
- Authors: Jicheng Ma, Guohua Wang, Xinhua Feng, Yiming Liu, Zhichao Hu, Yuhong Liu,
- Abstract summary: We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning.<n>The pipeline transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks.<n>Applying this pipeline yields textbfEternalMath, an evolving evaluation suite derived from contemporary research papers.
- Score: 23.517907682810932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current evaluations of mathematical reasoning in large language models (LLMs) are dominated by static benchmarks, either derived from competition-style problems or curated through costly expert effort, resulting in limited coverage of research-level mathematics and rapid performance saturation. We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning, which directly transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks. The pipeline identifies constructive or quantitative results, instantiates them into parameterized problem templates, and generates deterministic solutions through execution-based verification, enabling scalable, reproducible, and continuously updatable evaluation without reliance on large-scale expert authoring. By design, this approach supports temporal extensibility, intrinsic correctness checking, and domain-specific customization across mathematical subfields. Applying this pipeline yields \textbf{EternalMath}, an evolving evaluation suite derived from contemporary research papers. Experiments with state-of-the-art LLMs reveal substantial performance gaps, indicating that mathematical reasoning at the research frontier remains far from saturated and underscoring the need for evaluation methodologies that evolve in step with human mathematical discovery.
Related papers
- LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics [5.676144562388248]
We present a new approach for benchmarking Large Language Model capabilities on research-level mathematics.<n>Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research.<n>Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics.
arXiv Detail & Related papers (2026-02-27T16:52:52Z) - Max It or Miss It: Benchmarking LLM On Solving Extremal Problems [0.0]
We introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems.<n>We conduct evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek.<n>Results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks.
arXiv Detail & Related papers (2025-10-14T21:23:37Z) - PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning [57.868248683256574]
PRISM-Physics is a process-level evaluation framework and benchmark for complex physics reasoning problems.<n> Solutions are represented as directed acyclic graphs (DAGs) of formulas.<n>Results show that our evaluation framework is aligned with human experts' scoring.
arXiv Detail & Related papers (2025-10-03T17:09:03Z) - IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation [4.991157581428135]
IMProofBench is a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians.<n>Each problem requires a detailed proof and is paired with subproblems that have final answers.<n>Unlike prior benchmarks, the evaluation setup simulates a realistic research environment.
arXiv Detail & Related papers (2025-09-30T10:50:37Z) - An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems [48.10132234701036]
We introduce a systematic framework to assess LLMs' mathematical-reasoning robustness.<n>We stress-test them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation.<n>Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset.
arXiv Detail & Related papers (2025-08-12T10:40:33Z) - RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics [30.778394290919582]
Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions.<n>We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks.
arXiv Detail & Related papers (2025-05-18T23:32:46Z) - Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z) - BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.<n>We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.<n>We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z) - Token-Supervised Value Models for Enhancing Mathematical Problem-Solving Capabilities of Large Language Models [56.32800938317095]
Existing verifiers are sub-optimal for tree search techniques at test time.<n>We propose token-supervised value models (TVMs)<n>TVMs assign each token a probability that reflects the likelihood of reaching the correct final answer.
arXiv Detail & Related papers (2024-07-12T13:16:50Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.