Common 7B Language Models Already Possess Strong Math Capabilities
- URL: http://arxiv.org/abs/2403.04706v1
- Date: Thu, 7 Mar 2024 18:00:40 GMT
- Title: Common 7B Language Models Already Possess Strong Math Capabilities
- Authors: Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu,
Zheng Zhang, Houwen Peng
- Abstract summary: This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities.
The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
- Score: 61.61442513067561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mathematical capabilities were previously believed to emerge in common
language models only at a very large scale or require extensive math-related
pre-training. This paper shows that the LLaMA-2 7B model with common
pre-training already exhibits strong mathematical abilities, as evidenced by
its impressive accuracy of 97.7% and 72.0% on the GSM8K and MATH benchmarks,
respectively, when selecting the best response from 256 random generations. The
primary issue with the current base model is the difficulty in consistently
eliciting its inherent mathematical capabilities. Notably, the accuracy for the
first answer drops to 49.5% and 7.9% on the GSM8K and MATH benchmarks,
respectively. We find that simply scaling up the SFT data can significantly
enhance the reliability of generating correct answers. However, the potential
for extensive scaling is constrained by the scarcity of publicly available math
questions. To overcome this limitation, we employ synthetic data, which proves
to be nearly as effective as real data and shows no clear saturation when
scaled up to approximately one million samples. This straightforward approach
achieves an accuracy of 82.6% on GSM8K and 40.6% on MATH using LLaMA-2 7B
models, surpassing previous models by 14.2% and 20.8%, respectively. We also
provide insights into scaling behaviors across different reasoning complexities
and error types.
Related papers
- LLaMa-SciQ: An Educational Chatbot for Answering Science MCQ [0.0]
Large Language Models (LLMs) often struggle with tasks requiring mathematical reasoning, particularly multiple-choice questions (MCQs)
We developed LLaMa-SciQ to assist college students in solving and understanding MCQs in STEM fields.
For mathematical reasoning, LLaMa-SciQ achieved 74.5% accuracy on the GSM8k dataset and 30% on the MATH dataset.
arXiv Detail & Related papers (2024-09-25T09:41:46Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On [55.449818944278526]
We introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B language models.
Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark.
We provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.
arXiv Detail & Related papers (2024-07-11T09:56:51Z) - A Careful Examination of Large Language Model Performance on Grade School Arithmetic [4.573055530800853]
Large language models (LLMs) have achieved impressive success on many benchmarks for mathematical reasoning.
There is growing concern that some of this performance actually reflects dataset contamination.
arXiv Detail & Related papers (2024-05-01T05:52:05Z) - Key-Point-Driven Data Synthesis with its Enhancement on Mathematical Reasoning [110.80663974060624]
Key-Point-Driven Data Synthesis (KPDDS) is a novel data synthesis framework that synthesizes question-answer pairs.
KPDDS ensures the generation of novel questions with rigorous quality control and substantial scalability.
We present KPMath, an extensive synthetic dataset tailored for mathematical reasoning, comprising over 800K question-answer pairs.
arXiv Detail & Related papers (2024-03-04T18:58:30Z) - Orca-Math: Unlocking the potential of SLMs in Grade School Math [10.206509967833664]
A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters.
To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors.
Our approach has the following key elements: A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data.
arXiv Detail & Related papers (2024-02-16T23:44:38Z) - TinyGSM: achieving >80% on GSM8k with small language models [49.21136294791747]
Small-scale models offer various computational advantages, and yet to which extent size is critical for problem-solving abilities remains an open question.
Specifically for solving grade school math, the smallest model size so far required to break the 80% barrier on the GSM8K benchmark remains to be 34B.
Our work studies how high-quality datasets may be the key for small language models to acquire mathematical reasoning.
arXiv Detail & Related papers (2023-12-14T18:58:28Z) - MAmmoTH: Building Math Generalist Models through Hybrid Instruction
Tuning [60.208045804204076]
We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving.
The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset.
arXiv Detail & Related papers (2023-09-11T17:47:22Z) - A contextual analysis of multi-layer perceptron models in classifying
hand-written digits and letters: limited resources [0.0]
We extensively test an end-to-end vanilla neural network (MLP) approach in pure numpy without any pre-processing or feature extraction done beforehand.
We show that basic data mining operations can significantly improve the performance of the models in terms of computational time.
arXiv Detail & Related papers (2021-07-05T04:30:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.