CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?
- URL: http://arxiv.org/abs/2306.16636v1
- Date: Thu, 29 Jun 2023 02:19:50 GMT
- Title: CMATH: Can Your Language Model Pass Chinese Elementary School Math Test?
- Authors: Tianwen Wei, Jian Luan, Wei Liu, Shuang Dong, Bin Wang
- Abstract summary: We present the Chinese Elementary School Math Word Problems dataset, comprising 1.7k elementary school-level math word problems with detailed annotations.
This dataset aims to provide a benchmark tool for assessing the abilities of popular large language models (LLMs)
We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $geq$ 60%) across all six elementary school grades.
- Score: 15.53530547827583
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the Chinese Elementary School Math Word Problems (CMATH) dataset,
comprising 1.7k elementary school-level math word problems with detailed
annotations, source from actual Chinese workbooks and exams. This dataset aims
to provide a benchmark tool for assessing the following question: to what grade
level of elementary school math do the abilities of popular large language
models (LLMs) correspond? We evaluate a variety of popular LLMs, including both
commercial and open-source options, and discover that only GPT-4 achieves
success (accuracy $\geq$ 60\%) across all six elementary school grades, while
other models falter at different grade levels. Furthermore, we assess the
robustness of several top-performing LLMs by augmenting the original problems
in the CMATH dataset with distracting information. Our findings reveal that
GPT-4 is able to maintains robustness, while other model fail. We anticipate
that our study will expose limitations in LLMs' arithmetic and reasoning
capabilities, and promote their ongoing development and advancement.
Related papers
- HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics [1.5716764919736026]
We introduce HARDMath, a dataset featuring challenging applied mathematics problems that require analytical approximation techniques.
Our framework auto-generates a large number of problems with solutions validated against numerical ground truths.
We evaluate both open- and closed-source LLMs on HARDMath-mini, a sub-sampled test set of 366 problems, as well as on 40 word problems formulated in applied science contexts.
arXiv Detail & Related papers (2024-10-13T20:09:41Z) - MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs [61.74749961334557]
MathHay is an automated benchmark designed to assess the long-context mathematical reasoning capabilities of LLMs.
We conduct extensive experiments on MathHay to assess the long-context mathematical reasoning abilities of eight top-performing models.
arXiv Detail & Related papers (2024-10-07T02:30:07Z) - Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On [55.449818944278526]
We introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B language models.
Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark.
We provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.
arXiv Detail & Related papers (2024-07-11T09:56:51Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs.
We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.
This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z) - FineMath: A Fine-Grained Mathematical Evaluation Benchmark for Chinese Large Language Models [44.63505885248145]
FineMath is a fine-grained mathematical evaluation benchmark dataset for assessing Chinese Large Language Models (LLMs)
FineMath is created to cover the major key mathematical concepts taught in elementary school math, which are divided into 17 categories of math word problems.
All the 17 categories of math word problems are manually annotated with their difficulty levels according to the number of reasoning steps required to solve these problems.
arXiv Detail & Related papers (2024-03-12T15:32:39Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in
Chinese [21.893992064105085]
SuperCLUE-Math6 is a new benchmark dataset to evaluate the mathematical reasoning abilities of Chinese language models.
SC-Math6 is designed as an upgraded Chinese version of the GSM8K dataset with enhanced difficulty, diversity, and application scope.
arXiv Detail & Related papers (2024-01-22T10:30:11Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z) - Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
Large Language Models [23.344490944210456]
We present 515Bench, a more challenging benchmark dataset for evaluating the problem solving abilities of large language models (LLMs)
We curate challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT-Advanced exam.
Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%.
arXiv Detail & Related papers (2023-05-24T11:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.