Related papers: UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts

URL: http://arxiv.org/abs/2411.07240v2
Date: Tue, 14 Jan 2025 07:57:26 GMT
Title: UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
Authors: Bo Yang, Qingping Yang, Yingwei Ma, Runtao Liu,
Abstract summary: This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess Large Language Models.<n>It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem.<n>The best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%.
Score: 7.856746367263317
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evaluation of mathematical reasoning capabilities is essential for advancing Artificial General Intelligence (AGI). While Large Language Models (LLMs) have shown impressive performance in solving mathematical problems, existing benchmarks such as GSM8K and MATH present limitations, including narrow problem definitions with specific numbers and reliance on predetermined rules that hinder accurate assessments of reasoning and generality. This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess LLMs through extensive unit tests, with a focus on both the accuracy and generality of model responses. It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem. UTMath is highly challenging, with the best-performing model, o1-mini, solving only 32.57\% of the problems, followed by o1-preview at 27.16\%, and GPT-4o at 26.93\%. Furthermore, we present the Reasoning-to-Coding of Thoughts (RCoT) approach, which encourages LLMs to engage in explicit reasoning prior to code generation, thereby facilitating the production of more sophisticated solutions and enhancing overall performance and efficiency. Additionally, we also release the UTMath-Train training dataset (more than 70k samples), to support the community in further exploring mathematical reasoning. Our benchmark can be accessed via the following link: https://github.com/UTMathGroup/UTMath

Related papers

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models [11.964085209696051]
UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Our evaluation of 23 leading LLMs reveals that the highest EAcc robustness achieved is 56.3% by OpenAI-o1-mini, with large $Delta$ values observed across different models.
arXiv Detail & Related papers (2025-01-23T15:46:43Z)
HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics [1.5716764919736026]
We introduce HARDMath, a dataset featuring challenging applied mathematics problems that require analytical approximation techniques. Our framework auto-generates a large number of problems with solutions validated against numerical ground truths. We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts.
arXiv Detail & Related papers (2024-10-13T20:09:41Z)
Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models [63.31878920079154]
We propose a benchmark specifically designed to assess large language models' mathematical reasoning at the Olympiad level. Unlike existing Olympiad-related benchmarks, our dataset focuses exclusively on mathematics and comprises a vast collection of 4428 competition-level problems with rigorous human annotation. Our experimental results show that even the most advanced models, OpenAI o1-mini and OpenAI o1-preview, struggle with highly challenging Olympiad-level problems, with 60.54% and 52.55% accuracy, highlighting significant challenges in Olympiad-level mathematical reasoning.
arXiv Detail & Related papers (2024-10-10T14:39:33Z)
MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs. We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z)
BEATS: Optimizing LLM Mathematical Capabilities with BackVerify and Adaptive Disambiguate based Efficient Tree Search [22.672130194493793]
Large Language Models (LLMs) have exhibited exceptional performance across a broad range of tasks and domains. They still encounter difficulties in solving mathematical problems due to the rigorous and logical nature of mathematics. We propose a novel approach, BEATS, to enhance mathematical problem-solving abilities.
arXiv Detail & Related papers (2024-09-26T15:47:42Z)
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist [46.670206614087334]
We argue that if a model really understands a problem, it should be robustly applied across a diverse array of tasks. MathCheck is a well-designed checklist for testing task generalization and reasoning. MathCheck better reflects true mathematical abilities and represents mathematical intelligence more linearly.
arXiv Detail & Related papers (2024-07-11T17:58:58Z)
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z)
Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks [34.09857430966818]
We introduce an extensive mathematics dataset called "MathQuest" sourced from the 11th and 12th standard Mathematics NCERT textbooks. We conduct fine-tuning experiments with three prominent large language models: LLaMA-2, WizardMath, and MAmmoTH. Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems.
arXiv Detail & Related papers (2024-04-19T08:45:42Z)
Can Language Models Solve Olympiad Programming? [40.54366634332231]
This paper introduces the USACO benchmark with 307 problems from the USA Computing Olympiad. We construct and test a range of LM inference methods for competitive programming for the first time. We find GPT-4 only achieves a 8.7% pass@1 accuracy with zero-shot chain-of-thought prompting.
arXiv Detail & Related papers (2024-04-16T23:27:38Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. We show that ReasonEval achieves state-of-the-art performance on human-labeled datasets. We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities [25.857946070979576]
Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems annotated with concepts. This benchmark is difficult, with the best model only scoring 58.1% in standard settings. We find that models often arrive at the correct final answer through wrong reasoning steps.
arXiv Detail & Related papers (2024-01-13T03:18:16Z)
VerityMath: Advancing Mathematical Reasoning by Self-Verification Through Unit Consistency [33.760209585322606]
We study the performance of strong open-source LLMs on math word problems using program-based solving techniques. We propose a systematic approach by defining the units for each quantity and ensuring the consistency of these units during mathematical operations. Our approach, which incorporates unit consistency, currently slightly underperforms compared to an approach that does not.
arXiv Detail & Related papers (2023-11-13T09:06:58Z)
Lila: A Unified Benchmark for Mathematical Reasoning [59.97570380432861]
LILA is a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs. We introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA.
arXiv Detail & Related papers (2022-10-31T17:41:26Z)
JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding [74.12405417718054]
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model(PLM) Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement. We design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses.
arXiv Detail & Related papers (2022-06-13T17:03:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.