Related papers: RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

URL: http://arxiv.org/abs/2505.12575v1
Date: Sun, 18 May 2025 23:32:46 GMT
Title: RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics
Authors: Jie Zhang, Cezara Petrui, Kristina Nikolić, Florian Tramèr,
Abstract summary: Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions.<n>We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks.
Score: 21.453837660747844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics encountered in actual research environments. We introduce RealMath, a novel benchmark derived directly from research papers and mathematical forums that assesses LLMs' abilities on authentic mathematical tasks. Our approach addresses three critical challenges: sourcing diverse research-level content, enabling reliable automated evaluation through verifiable statements, and designing a continually refreshable dataset to mitigate contamination risks. Experimental results across multiple LLMs reveal surprising capabilities in handling research mathematics compared to competition problems, suggesting current models may already serve as valuable assistants for working mathematicians despite limitations on highly challenging problems. The code and dataset for RealMath are publicly available.

Related papers

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics [5.676144562388248]
We present a new approach for benchmarking Large Language Model capabilities on research-level mathematics.<n>Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for mathematical research.<n>Instead, we establish an updatable benchmark evaluating models directly on the latest research results in mathematics.
arXiv Detail & Related papers (2026-02-27T16:52:52Z)
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation [4.991157581428135]
IMProofBench is a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians.<n>Each problem requires a detailed proof and is paired with subproblems that have final answers.<n>Unlike prior benchmarks, the evaluation setup simulates a realistic research environment.
arXiv Detail & Related papers (2025-09-30T10:50:37Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
Large Language Models for Mathematical Analysis [3.7325315394927023]
This work addresses critical gaps in mathematical reasoning and contributes to advancing trustworthy AI.<n>We developed the DEMI-MathAnalysis dataset, comprising proof-based problems from mathematical analysis topics.<n>We also designed a guiding framework to rigorously enhance LLMs' ability to solve these problems.
arXiv Detail & Related papers (2024-12-28T20:37:55Z)
Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning [85.635988711588]
We argue that enhancing the capabilities of large language models requires a paradigm shift in the design of mathematical datasets.<n>We advocate for mathematical dataset developers to consider the concept of "motivated proof", introduced by G. P'olya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal.<n>We provide a questionnaire designed specifically for math datasets that we urge creators to include with their datasets.
arXiv Detail & Related papers (2024-12-19T18:55:17Z)
HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics [1.5716764919736026]
We introduce HARDMath, a dataset featuring challenging applied mathematics problems that require analytical approximation techniques.<n>Our framework auto-generates a large number of problems with solutions validated against numerical ground truths.<n>We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts.
arXiv Detail & Related papers (2024-10-13T20:09:41Z)
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data [20.31528845718877]
Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset.
arXiv Detail & Related papers (2024-06-26T13:02:35Z)
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z)
Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks [34.09857430966818]
We introduce an extensive mathematics dataset called "MathQuest" sourced from the 11th and 12th standard Mathematics NCERT textbooks. We conduct fine-tuning experiments with three prominent large language models: LLaMA-2, WizardMath, and MAmmoTH. Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems.
arXiv Detail & Related papers (2024-04-19T08:45:42Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.