Orca-Math: Unlocking the potential of SLMs in Grade School Math
- URL: http://arxiv.org/abs/2402.14830v1
- Date: Fri, 16 Feb 2024 23:44:38 GMT
- Title: Orca-Math: Unlocking the potential of SLMs in Grade School Math
- Authors: Arindam Mitra, Hamed Khanpour, Corby Rosset, Ahmed Awadallah
- Abstract summary: A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters.
To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors.
Our approach has the following key elements: A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data.
- Score: 10.206509967833664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mathematical word problem-solving has long been recognized as a complex task
for small language models (SLMs). A recent study hypothesized that the smallest
model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34
billion parameters. To reach this level of performance with smaller models,
researcher often train SLMs to generate Python code or use tools to help avoid
calculation errors. Additionally, they employ ensembling, where outputs of up
to 100 model runs are combined to arrive at a more accurate result. Result
selection is done using consensus, majority vote or a separate a verifier model
used in conjunction with the SLM. Ensembling provides a substantial boost in
accuracy but at a significant cost increase with multiple calls to the model
(e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5).
In this work, we present Orca-Math, a 7-billion-parameter SLM based on the
Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model
calls or the use of verifiers, code execution or any other external tools. Our
approach has the following key elements: (1) A high quality synthetic dataset
of 200K math problems created using a multi-agent setup where agents
collaborate to create the data, (2) An iterative learning techniques that
enables the SLM to practice solving problems, receive feedback on its solutions
and learn from preference pairs incorporating the SLM solutions and the
feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves
81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math
achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly
larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It
also significantly outperforms other smaller models while using much smaller
data (hundreds of thousands vs. millions of problems).
Related papers
- DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning [24.68321102981711]
We introduce a series of large language models (LLMs) that employ the Decomposition of thought with code assistance and self-correction for mathematical reasoning, dubbed as DotaMath.
DotaMath models tackle complex mathematical tasks by decomposing them into simpler logical subtasks, leveraging code to solve these subtasks, and engaging in self-reflection and correction.
We train a series of base LLMs using imitation learning on DotaMathQA, resulting in DotaMath models that achieve remarkable performance compared to open-source LLMs.
arXiv Detail & Related papers (2024-07-04T17:39:16Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - A Careful Examination of Large Language Model Performance on Grade School Arithmetic [4.667380916143971]
Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning.
There is growing concern that some of this performance actually reflects dataset contamination.
arXiv Detail & Related papers (2024-05-01T05:52:05Z) - Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities.
The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z) - MathGenie: Generating Synthetic Data with Question Back-translation for
Enhancing Mathematical Reasoning of LLMs [39.769464414087935]
MathGenie is a novel method for generating diverse and reliable math problems from a small-scale problem-solution dataset.
Various pretrained models, ranging from 7B to 70B, are trained on the newly curated data to test the effectiveness of the proposed augmentation technique.
MathGenieLM-InternLM2 achieves an accuracy of 87.7% on GSM8K and 55.7% on MATH, securing the best overall score among open-source language models.
arXiv Detail & Related papers (2024-02-26T07:17:25Z) - TinyGSM: achieving >80% on GSM8k with small language models [49.21136294791747]
Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question.
Specifically for solving grade school math, the smallest model size so far required to break the 80% barrier on the GSM8K benchmark remains to be 34B.
Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning.
arXiv Detail & Related papers (2023-12-14T18:58:28Z) - Look Before You Leap: A Universal Emergent Decomposition of Retrieval
Tasks in Language Models [58.57279229066477]
We study how language models (LMs) solve retrieval tasks in diverse situations.
We introduce ORION, a collection of structured retrieval tasks spanning six domains.
We find that LMs internally decompose retrieval tasks in a modular way.
arXiv Detail & Related papers (2023-12-13T18:36:43Z) - MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models [91.66694225955872]
We propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning.
Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge.
We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
arXiv Detail & Related papers (2023-09-21T17:45:42Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.