Orca-Math: Unlocking the potential of SLMs in Grade School Math
- URL: http://arxiv.org/abs/2402.14830v1
- Date: Fri, 16 Feb 2024 23:44:38 GMT
- Title: Orca-Math: Unlocking the potential of SLMs in Grade School Math
- Authors: Arindam Mitra, Hamed Khanpour, Corby Rosset, Ahmed Awadallah
- Abstract summary: A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters.
To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors.
Our approach has the following key elements: A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data.
- Score: 10.206509967833664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mathematical word problem-solving has long been recognized as a complex task
for small language models (SLMs). A recent study hypothesized that the smallest
model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34
billion parameters. To reach this level of performance with smaller models,
researcher often train SLMs to generate Python code or use tools to help avoid
calculation errors. Additionally, they employ ensembling, where outputs of up
to 100 model runs are combined to arrive at a more accurate result. Result
selection is done using consensus, majority vote or a separate a verifier model
used in conjunction with the SLM. Ensembling provides a substantial boost in
accuracy but at a significant cost increase with multiple calls to the model
(e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5).
In this work, we present Orca-Math, a 7-billion-parameter SLM based on the
Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model
calls or the use of verifiers, code execution or any other external tools. Our
approach has the following key elements: (1) A high quality synthetic dataset
of 200K math problems created using a multi-agent setup where agents
collaborate to create the data, (2) An iterative learning techniques that
enables the SLM to practice solving problems, receive feedback on its solutions
and learn from preference pairs incorporating the SLM solutions and the
feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves
81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math
achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly
larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It
also significantly outperforms other smaller models while using much smaller
data (hundreds of thousands vs. millions of problems).
Related papers
- Not All Votes Count! Programs as Verifiers Improve Self-Consistency of Language Models for Math Reasoning [24.386388107656334]
We propose PROVE, a framework that uses program-based verification to filter out potentially incorrect reasoning paths.
Instead of relying on vanilla majority voting, our approach rejects solutions whose corresponding program outputs are inconsistent with the generated solution.
PROVE consistently outperforms vanilla voting as a majority for solving mathematical reasoning tasks across all datasets and model sizes.
arXiv Detail & Related papers (2024-10-16T14:24:55Z) - SuperCorrect: Supervising and Correcting Language Models with Error-Driven Insights [89.56181323849512]
We propose SuperCorrect, a framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model.
In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts.
In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model.
arXiv Detail & Related papers (2024-10-11T17:25:52Z) - LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ [0.0]
Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs)
We developed LLaMa-SciQ to assist college students in solving and understanding MCQs in STEM fields.
For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset.
arXiv Detail & Related papers (2024-09-25T09:41:46Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - A Careful Examination of Large Language Model Performance on Grade School Arithmetic [4.667380916143971]
Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning.
There is growing concern that some of this performance actually reflects dataset contamination.
arXiv Detail & Related papers (2024-05-01T05:52:05Z) - Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities.
The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z) - TinyGSM: achieving >80% on GSM8k with small language models [49.21136294791747]
Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question.
Specifically for solving grade school math, the smallest model size so far required to break the 80% barrier on the GSM8K benchmark remains to be 34B.
Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning.
arXiv Detail & Related papers (2023-12-14T18:58:28Z) - MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models [91.66694225955872]
We propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning.
Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge.
We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
arXiv Detail & Related papers (2023-09-21T17:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.