Related papers: A Careful Examination of Large Language Model Performance on Grade School Arithmetic

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

URL: http://arxiv.org/abs/2405.00332v3
Date: Fri, 3 May 2024 17:53:26 GMT
Title: A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Authors: Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, Summer Yue,
Abstract summary: Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. There is growing concern that some of this performance actually reflects dataset contamination.
Score: 4.667380916143971
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning. However, there is growing concern that some of this performance actually reflects dataset contamination, where data closely resembling benchmark questions leaks into the training data, instead of true reasoning ability. To investigate this claim rigorously, we commission Grade School Math 1000 (GSM1k). GSM1k is designed to mirror the style and complexity of the established GSM8k benchmark, the gold standard for measuring elementary mathematical reasoning. We ensure that the two benchmarks are comparable across important metrics such as human solve rates, number of steps in solution, answer magnitude, and more. When evaluating leading open- and closed-source LLMs on GSM1k, we observe accuracy drops of up to 13%, with several families of models (e.g., Phi and Mistral) showing evidence of systematic overfitting across almost all model sizes. At the same time, many models, especially those on the frontier, (e.g., Gemini/GPT/Claude) show minimal signs of overfitting. Further analysis suggests a positive relationship (Spearman's r^2=0.32) between a model's probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that many models may have partially memorized GSM8k.

Related papers

Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models [24.266973481633755]
The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. We introduce GSM-Symbolic, an improved benchmark created from symbolic templates. Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question.
arXiv Detail & Related papers (2024-10-07T17:36:37Z)
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems [5.755472812258969]
We present Scheherazade, an automated approach to produce large quantities of challenging mathematical reasoning benchmarks. We propose two different chaining methods, forward chaining and backward chaining, which include randomized branching techniques. We show that while other frontier models' performance declines precipitously at only a few questions chained, our evaluation suggests o1-preview's performance persists.
arXiv Detail & Related papers (2024-09-30T18:48:34Z)
Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities. The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z)
Orca-Math: Unlocking the potential of SLMs in Grade School Math [10.206509967833664]
A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors. Our approach has the following key elements: A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data.
arXiv Detail & Related papers (2024-02-16T23:44:38Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
TinyGSM: achieving >80% on GSM8k with small language models [49.21136294791747]
Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question. Specifically for solving grade school math, the smallest model size so far required to break the 80% barrier on the GSM8K benchmark remains to be 34B. Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning.
arXiv Detail & Related papers (2023-12-14T18:58:28Z)
MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning [54.2093509928664]
In math reasoning with large language models, fine-tuning data augmentation by query evolution and diverse reasoning paths is empirically verified effective. We conduct an investigation for such data augmentation in math reasoning and are intended to answer these questions. We release our codes and augmented data in https://github.com/OFA-Sys/8k-Scel.
arXiv Detail & Related papers (2023-10-09T08:18:58Z)
Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models. We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.