Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
- URL: http://arxiv.org/abs/2504.01995v2
- Date: Thu, 10 Apr 2025 20:43:23 GMT
- Title: Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
- Authors: Hamed Mahdavi, Alireza Hashemi, Majid Daliri, Pegah Mohammadipour, Alireza Farhadi, Samira Malek, Yekta Yazdanifard, Amir Khasahmadi, Vasant Honavar,
- Abstract summary: Large language models (LLMs) have shown impressive progress in mathematical reasoning tasks.<n>Recent advances in large language models (LLMs) have shown impressive progress in mathematical reasoning tasks.
- Score: 2.489157527463306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models (LLMs) have shown impressive progress in mathematical reasoning tasks. However, current evaluation benchmarks predominantly focus on the accuracy of final answers, often overlooking the crucial logical rigor for mathematical problem solving. The claim that state-of-the-art LLMs can solve Math Olympiad-level problems requires closer examination. To explore this, we conducted both qualitative and quantitative human evaluations of proofs generated by LLMs, and developed a schema for automatically assessing their reasoning capabilities. Our study reveals that current LLMs fall significantly short of solving challenging Olympiad-level problems and frequently fail to distinguish correct mathematical reasoning from clearly flawed solutions. Our analyses demonstrate that the occasional correct final answers provided by LLMs often result from pattern recognition or heuristic shortcuts rather than genuine mathematical reasoning. These findings underscore the substantial gap between LLM performance and human expertise in advanced mathematical reasoning and highlight the importance of developing benchmarks that prioritize the soundness of the reasoning used to arrive at an answer rather than the mere correctness of the final answers.
Related papers
- Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z) - Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z) - MathConstruct: Challenging LLM Reasoning with Constructive Proofs [0.9320657506524149]
mc is a new benchmark of 126 challenging problems sourced from various math competitions.<n>mc is suitable for Large Language Models evaluation, as solution correctness can be easily verified.
arXiv Detail & Related papers (2025-02-14T14:44:22Z) - One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs [57.48325300739872]
Leveraging mathematical Large Language Models for proof generation is a fundamental topic in LLMs research.
We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training.
Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs' ability to conduct mathematical reasoning and proof through counterexamples.
arXiv Detail & Related papers (2025-02-12T02:01:10Z) - JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models.
JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures.
Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z) - MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs.
We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.