Related papers: Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems

URL: http://arxiv.org/abs/2510.12997v2
Date: Fri, 17 Oct 2025 21:17:43 GMT
Title: Max It or Miss It: Benchmarking LLM On Solving Extremal Problems
Authors: Binxin Gao, Jingjun Han,
Abstract summary: We introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems.<n>We conduct evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek.<n>Results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into $93$ standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.

Related papers

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery [23.517907682810932]
We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning.<n>The pipeline transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks.<n>Applying this pipeline yields textbfEternalMath, an evolving evaluation suite derived from contemporary research papers.
arXiv Detail & Related papers (2026-01-04T06:40:25Z)
IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation [4.991157581428135]
IMProofBench is a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians.<n>Each problem requires a detailed proof and is paired with subproblems that have final answers.<n>Unlike prior benchmarks, the evaluation setup simulates a realistic research environment.
arXiv Detail & Related papers (2025-09-30T10:50:37Z)
Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning [0.0]
Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning.<n>We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.
arXiv Detail & Related papers (2025-05-21T15:12:20Z)
Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z)
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad [4.573289946657861]
We evaluate reasoning models on six problems from the 2025 USAMO.<n>Only Gemini-2.5-Pro achieves a non-trivial score of 25%.<n>Our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-27T19:21:05Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist [46.670206614087334]
We argue that if a model really understands a problem, it should be robustly applied across a diverse array of tasks. MathCheck is a well-designed checklist for testing task generalization and reasoning. MathCheck better reflects true mathematical abilities and represents mathematical intelligence more linearly.
arXiv Detail & Related papers (2024-07-11T17:58:58Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation. We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets. We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.