TinyGSM: achieving >80% on GSM8k with small language models
- URL: http://arxiv.org/abs/2312.09241v1
- Date: Thu, 14 Dec 2023 18:58:28 GMT
- Title: TinyGSM: achieving >80% on GSM8k with small language models
- Authors: Bingbin Liu, Sebastien Bubeck, Ronen Eldan, Janardhan Kulkarni,
Yuanzhi Li, Anh Nguyen, Rachel Ward, Yi Zhang
- Abstract summary: Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question.
Specifically for solving grade school math, the smallest model size so far required to break the 80% barrier on the GSM8K benchmark remains to be 34B.
Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning.
- Score: 49.21136294791747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Small-scale models offer various computational advantages, and yet to which
extent size is critical for problem-solving abilities remains an open question.
Specifically for solving grade school math, the smallest model size so far
required to break the 80\% barrier on the GSM8K benchmark remains to be 34B.
Our work studies how high-quality datasets may be the key for small language
models to acquire mathematical reasoning. We introduce \texttt{TinyGSM}, a
synthetic dataset of 12.3M grade school math problems paired with Python
solutions, generated fully by GPT-3.5. After finetuning on \texttt{TinyGSM}, we
find that a duo of a 1.3B generation model and a 1.3B verifier model can
achieve 81.5\% accuracy, outperforming existing models that are orders of
magnitude larger. This also rivals the performance of the GPT-3.5 ``teacher''
model (77.4\%), from which our model's training data is generated. Our approach
is simple and has two key components: 1) the high-quality dataset
\texttt{TinyGSM}, 2) the use of a verifier, which selects the final outputs
from multiple candidate generations.
Related papers
- SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights [89.56181323849512]
We propose SuperCorrect, a framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model.
In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts.
In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model.
arXiv Detail & Related papers (2024-10-11T17:25:52Z) - A Careful Examination of Large Language Model Performance on Grade School Arithmetic [4.667380916143971]
Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning.
There is growing concern that some of this performance actually reflects dataset contamination.
arXiv Detail & Related papers (2024-05-01T05:52:05Z) - Emergent Abilities in Reduced-Scale Generative Language Models [10.51168925267033]
Large language models can solve new tasks without task-specific fine-tuning.
This ability is considered an emergent ability and is primarily seen in large language models with billions of parameters.
This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data.
arXiv Detail & Related papers (2024-04-02T18:00:28Z) - Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities.
The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z) - MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs [38.127313175508746]
MathGenie is a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset.
Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique.
MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source language models.
arXiv Detail & Related papers (2024-02-26T07:17:25Z) - Orca-Math: Unlocking the potential of SLMs in Grade School Math [10.206509967833664]
A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters.
To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors.
Our approach has the following key elements: A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data.
arXiv Detail & Related papers (2024-02-16T23:44:38Z) - Reusing Pretrained Models by Multi-linear Operators for Efficient
Training [65.64075958382034]
Training large models from scratch usually costs a substantial amount of resources.
Recent studies such as bert2BERT and LiGO have reused small pretrained models to initialize a large model.
We propose a method that linearly correlates each weight of the target model to all the weights of the pretrained model.
arXiv Detail & Related papers (2023-10-16T06:16:47Z) - MAmmoTH: Building Math Generalist Models through Hybrid Instruction
Tuning [60.208045804204076]
We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving.
The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset.
arXiv Detail & Related papers (2023-09-11T17:47:22Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.