Related papers: Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

URL: http://arxiv.org/abs/2508.08292v2
Date: Wed, 27 Aug 2025 02:14:01 GMT
Title: Putnam-AXIOM: A Functional and Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs
Authors: Aryan Gulati, Brando Miranda, Eric Chen, Emily Xia, Kai Fronsdal, Bruno Dumont, Elyas Obbad, Sanmi Koyejo,
Abstract summary: Current benchmarks for large language models (LLMs) are approaching saturation and are increasingly compromised by trainingset contamination.<n>We introduce Putnam-AXIOM, a benchmark drawn from the prestigious William Lowell Putnam Mathematical Competition.<n>The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding contamination-resilient test bed.
Score: 19.592385109516268
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving > 90% accuracy, and are increasingly compromised by training-set contamination. We introduce Putnam-AXIOM, a benchmark of 522 university-level competition problems drawn from the prestigious William Lowell Putnam Mathematical Competition, and Putnam-AXIOM Variation, an unseen companion set of 100 functional variants generated by programmatically perturbing variables and constants. The variation protocol produces an unlimited stream of equally difficult, unseen instances -- yielding a contamination-resilient test bed. On the Original set, OpenAI's o1-preview -- the strongest evaluated model -- scores 41.9%, but its accuracy drops by 19.6% (46.8% relative decrease) on the paired Variations. The remaining eighteen models show the same downward trend, ten of them with non-overlapping 95% confidence intervals. These gaps suggest memorization and highlight the necessity of dynamic benchmarks. We complement "boxed" accuracy with Teacher-Forced Accuracy (TFA), a lightweight metric that directly scores reasoning traces and automates natural language proof evaluations. Putnam-AXIOM therefore provides a rigorous, contamination-resilient evaluation framework for assessing advanced mathematical reasoning of LLMs. Data and evaluation code are publicly available at https://github.com/brando90/putnam-axiom.

Related papers

Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals [18.612081365101464]
We develop a framework that combines standard labeled outcomes with pairwise comparison signals obtained by having models judge auxiliary reasoning chains.<n>Across simulations, our one-step estimator substantially improves ranking accuracy with gains increasing as model output noise grows.<n>Experiments on GPQA Diamond, AIME 2025 and GSM8K further demonstrate more precise performance estimation and more reliable model rankings.
arXiv Detail & Related papers (2026-02-03T03:40:01Z)
From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
MSC-180: A Benchmark for Automated Formal Theorem Proving from Mathematical Subject Classification [21.9173105378467]
Current large language model (LLM)-based theorem provers suffer from limitations such as restricted domain coverage and weak generalization in mathematical reasoning.<n>We propose MSC-180, a benchmark for evaluation based on the MSC 2020 mathematical subject classification.<n>It comprises 180 formal verification problems, 3 advanced problems from each of 60 mathematical branches, spanning from undergraduate to graduate levels.
arXiv Detail & Related papers (2025-12-20T07:39:19Z)
Relative Scaling Laws for LLMs [91.73497548097775]
Scaling laws describe how language models improve with additional data, parameters, and compute.<n>We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale.<n>These results show that although scaling improves overall performance, it is not a universal equalizer.
arXiv Detail & Related papers (2025-10-28T16:55:22Z)
WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning [51.13280433665446]
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics.<n>In wireless communications, where problems require precise manipulation of information-theoretic bounds, even state-of-the-art models struggle to achieve competent performance.<n>We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning.
arXiv Detail & Related papers (2025-09-27T09:58:03Z)
VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks [25.295071827427677]
emphbenchmark contamination arises from the public availability of test problems.<n>emphevaluation fragility stems from the reliance on single-instance assessments.<n>IME-MATH is a symbolic evaluation framework designed to probe genuine reasoning ability.
arXiv Detail & Related papers (2025-07-17T08:10:55Z)
Solving Inequality Proofs with Large Language Models [46.71658812761115]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z)
LookAlike: Consistent Distractor Generation in Math MCQs [42.19039301965107]
We propose LookAlike, a method that improves error-distractor consistency via preference optimization.<n>Our two main innovations are: (a) mining synthetic preference pairs from model inconsistencies, and (b) alternating supervised fine-tuning.<n>LookAlike achieves 51.6% accuracy in distractor generation and 57.2% in error generation under LLM-as-a-judge evaluation.
arXiv Detail & Related papers (2025-05-03T19:18:06Z)
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level.<n>Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation.<n>Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood. Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings. In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.