Related papers: SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese

SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese

URL: http://arxiv.org/abs/2401.11819v2
Date: Fri, 2 Feb 2024 02:35:13 GMT
Title: SuperCLUE-Math6: Graded Multi-Step Math Reasoning Benchmark for LLMs in Chinese
Authors: Liang Xu, Hang Xue, Lei Zhu, Kangkang Zhao
Abstract summary: SuperCLUE-Math6 is a new benchmark dataset to evaluate the mathematical reasoning abilities of Chinese language models. SC-Math6 is designed as an upgraded Chinese version of the GSM8K dataset with enhanced difficulty, diversity, and application scope.
Score: 21.893992064105085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SuperCLUE-Math6(SC-Math6), a new benchmark dataset to evaluate the mathematical reasoning abilities of Chinese language models. SC-Math6 is designed as an upgraded Chinese version of the GSM8K dataset with enhanced difficulty, diversity, and application scope. It consists of over 2000 mathematical word problems requiring multi-step reasoning and providing natural language solutions. We propose an innovative scheme to quantify the reasoning capability of large models based on performance over problems with different reasoning steps. Experiments on 13 representative Chinese models demonstrate a clear stratification of reasoning levels, with top models like GPT-4 showing superior performance. SC-Math6 fills the gap in Chinese mathematical reasoning benchmarks and provides a comprehensive testbed to advance the intelligence of Chinese language models.

Related papers

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models [19.47343987998194]
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks. Their performance on numerical reasoning tasks, such as basic arithmetic, numerical, and magnitude comparison, remains surprisingly poor. Existing benchmarks primarily focus on linguistic competence or structured mathematical problem-solving.
arXiv Detail & Related papers (2025-02-16T10:48:28Z)
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts [8.582930981424528]
This paper introduces the UTMath Benchmark, which robustly evaluates the models through extensive unit tests. It consists of 1,053 problems across 9 mathematical domains, with over 68 test cases per problem. We introduce the Reasoning-to-Coding of Thoughts (RCoT) approach, which encourages LLMs to perform explicit reasoning before generating code.
arXiv Detail & Related papers (2024-11-11T18:59:02Z)
RoMath: A Mathematical Reasoning Benchmark in Romanian [7.7559527224629266]
This paper introduces RoMath, a Romanian mathematical reasoning benchmark suite comprising three datasets. By focusing on Romanian, a low-resource language with unique linguistic features, RoMath addresses the limitations of Anglo-centric models. We benchmark several open-weight language models, highlighting the importance of creating resources for underrepresented languages.
arXiv Detail & Related papers (2024-09-17T11:03:46Z)
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models [41.02149566318779]
We propose a Chinese Multi-modal Math Skill Evaluation Benchmark, named CMMaTH, contraining 23k multimodal K12 math related questions. We have constructed an open-source tool GradeGPT integrated with the CMMaTH dataset, facilitating stable, rapid, and cost-free model evaluation.
arXiv Detail & Related papers (2024-06-28T02:35:51Z)
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark [82.64129627675123]
MathBench is a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills.
arXiv Detail & Related papers (2024-05-20T17:52:29Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
CMATH: Can Your Language Model Pass Chinese Elementary School Math Test? [15.53530547827583]
We present the Chinese Elementary School Math Word Problems dataset, comprising 1.7k elementary school-level math word problems with detailed annotations. This dataset aims to provide a benchmark tool for assessing the abilities of popular large language models (LLMs) We evaluate a variety of popular LLMs, including both commercial and open-source options, and discover that only GPT-4 achieves success (accuracy $geq$ 60%) across all six elementary school grades.
arXiv Detail & Related papers (2023-06-29T02:19:50Z)
PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems. PaL offloads the solution step to a programmatic runtime such as a Python interpreter. We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
Language Models are Multilingual Chain-of-Thought Reasoners [83.37148309771378]
We introduce the Multilingual Grade School Math benchmark, by manually translating 250 grade-school math problems into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing model scale. We show that the multilingual reasoning abilities of language models extend to other tasks such as commonsense reasoning and word-in-context semantic judgment.
arXiv Detail & Related papers (2022-10-06T17:03:34Z)
JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding [74.12405417718054]
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model(PLM) Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement. We design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses.
arXiv Detail & Related papers (2022-06-13T17:03:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.