Max It or Miss It: Benchmarking LLM On Solving Extremal Problems
- URL: http://arxiv.org/abs/2510.12997v2
- Date: Fri, 17 Oct 2025 21:17:43 GMT
- Title: Max It or Miss It: Benchmarking LLM On Solving Extremal Problems
- Authors: Binxin Gao, Jingjun Han,
- Abstract summary: We introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems.<n>We conduct evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek.<n>Results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Test-time scaling has enabled Large Language Models (LLMs) with remarkable reasoning capabilities, particularly in mathematical domains, through intermediate chain-of-thought (CoT) reasoning before generating final answers. However, the specific sources and mechanisms underlying these reasoning capabilities remain insufficiently understood. Optimization reasoning, i.e. finding extrema under constraints, represents a fundamental abstraction that underpins critical applications in planning, control, resource allocation, and prompt search. To systematically evaluate this capability, we introduce ExtremBench, a benchmark dataset for solving mathematical extremal problems, curated from inequality exercises used for Chinese Mathematical Olympiad and transformed into $93$ standardized extrema-finding problems. We conduct extensive evaluations across various state-of-the-art open-source model families, including the Qwen3, GPT-OSS, and DeepSeek. Our results reveal that LLMs' extremal-solving reasoning capabilities do not always align with those of current mathematical benchmarks such as AIME25 and MATH-500, with some models showing strong general mathematical reasoning but poor extremal-solving skills, and vice versa. This discrepancy highlights a critical gap in current evaluation practices and suggests that existing benchmarks may not comprehensively capture the full spectrum of mathematical reasoning abilities.
Related papers
- From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z) - EternalMath: A Living Benchmark of Frontier Mathematics that Evolves with Human Discovery [23.517907682810932]
We propose a fully automated, theorem-grounded pipeline for evaluating frontier mathematical reasoning.<n>The pipeline transforms recent peer-reviewed mathematical literature into executable and verifiable reasoning tasks.<n>Applying this pipeline yields textbfEternalMath, an evolving evaluation suite derived from contemporary research papers.
arXiv Detail & Related papers (2026-01-04T06:40:25Z) - IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation [4.991157581428135]
IMProofBench is a private benchmark consisting of 39 peer-reviewed problems developed by expert mathematicians.<n>Each problem requires a detailed proof and is paired with subproblems that have final answers.<n>Unlike prior benchmarks, the evaluation setup simulates a realistic research environment.
arXiv Detail & Related papers (2025-09-30T10:50:37Z) - Can LLMs $\textit{understand}$ Math? -- Exploring the Pitfalls in Mathematical Reasoning [0.0]
Large language models (LLMs) demonstrate considerable potential in various natural language tasks but face significant challenges in mathematical reasoning.<n>We propose an evaluation metric called the MAPLE score, which holistically quantifies reasoning misalignment by integrating error rates, redundancy, and validity.
arXiv Detail & Related papers (2025-05-21T15:12:20Z) - Computational Reasoning of Large Language Models [51.629694188014064]
We introduce textbfTuring Machine Bench, a benchmark to assess the ability of Large Language Models (LLMs) to execute reasoning processes.<n> TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a theoretical foundation based on Turing machine.
arXiv Detail & Related papers (2025-04-29T13:52:47Z) - Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad [4.573289946657861]
We evaluate reasoning models on six problems from the 2025 USAMO.<n>Only Gemini-2.5-Pro achieves a non-trivial score of 25%.<n>Our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-27T19:21:05Z) - Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z) - PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z) - Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z) - ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection.
ErrorRadar evaluates two sub-tasks: error step identification and error categorization.
It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions.
Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z) - Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist [46.670206614087334]
We argue that if a model really understands a problem, it should be robustly applied across a diverse array of tasks.
MathCheck is a well-designed checklist for testing task generalization and reasoning.
MathCheck better reflects true mathematical abilities and represents mathematical intelligence more linearly.
arXiv Detail & Related papers (2024-07-11T17:58:58Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation.
We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets.
We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.