SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in
Chinese
- URL: http://arxiv.org/abs/2401.11819v2
- Date: Fri, 2 Feb 2024 02:35:13 GMT
- Title: SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in
Chinese
- Authors: Liang Xu, Hang Xue, Lei Zhu, Kangkang Zhao
- Abstract summary: SuperCLUE-Math6 is a new benchmark dataset to evaluate the mathematical reasoning abilities of Chinese language models.
SC-Math6 is designed as an upgraded Chinese version of the GSM8K dataset with enhanced difficulty, diversity, and application scope.
- Score: 21.893992064105085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce SuperCLUE-Math6(SC-Math6), a new benchmark dataset to evaluate
the mathematical reasoning abilities of Chinese language models. SC-Math6 is
designed as an upgraded Chinese version of the GSM8K dataset with enhanced
difficulty, diversity, and application scope. It consists of over 2000
mathematical word problems requiring multi-step reasoning and providing natural
language solutions. We propose an innovative scheme to quantify the reasoning
capability of large models based on performance over problems with different
reasoning steps. Experiments on 13 representative Chinese models demonstrate a
clear stratification of reasoning levels, with top models like GPT-4 showing
superior performance. SC-Math6 fills the gap in Chinese mathematical reasoning
benchmarks and provides a comprehensive testbed to advance the intelligence of
Chinese language models.
Related papers
- UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts [8.582930981424528]
This paper introduces the UTMath Benchmark, which robustly evaluates the models through extensive unit tests.
It consists of 1,053 problems across 9 mathematical domains, with over 68 test cases per problem.
We introduce the Reasoning-to-Coding of Thoughts (RCoT) approach, which encourages LLMs to perform explicit reasoning before generating code.
arXiv Detail & Related papers (2024-11-11T18:59:02Z) - RoMath: A Mathematical Reasoning Benchmark in Romanian [7.7559527224629266]
This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three datasets.
By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models.
We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages.
arXiv Detail & Related papers (2024-09-17T11:03:46Z) - CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models [41.02149566318779]
We propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions.
We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation.
arXiv Detail & Related papers (2024-06-28T02:35:51Z) - MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models.
MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? [15.53530547827583]
We present the Chinese Elementary School Math Word Problems dataset, comprising 1.7k elementary school-level math word problems with detailed annotations.
This dataset aims to provide a benchmark tool for assessing the abilities of popular large language models (LLMs)
We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $geq$ 60%) across all six elementary school grades.
arXiv Detail & Related papers (2023-06-29T02:19:50Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z) - Language Models are Multilingual Chain-of-Thought Reasoners [83.37148309771378]
We introduce the Multilingual Grade School Math benchmark, by manually translating 250 grade-school math problems into ten typologically diverse languages.
We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale.
We show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment.
arXiv Detail & Related papers (2022-10-06T17:03:34Z) - JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem
Understanding [74.12405417718054]
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model(PLM)
Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement.
We design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses.
arXiv Detail & Related papers (2022-06-13T17:03:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.