Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
- URL: http://arxiv.org/abs/2409.12122v1
- Date: Wed, 18 Sep 2024 16:45:37 GMT
- Title: Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement
- Authors: An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, Zhenru Zhang,
- Abstract summary: We present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B.
Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities.
- Score: 71.46993852662021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we present a series of math-specific large language models: Qwen2.5-Math and Qwen2.5-Math-Instruct-1.5B/7B/72B. The core innovation of the Qwen2.5 series lies in integrating the philosophy of self-improvement throughout the entire pipeline, from pre-training and post-training to inference: (1) During the pre-training phase, Qwen2-Math-Instruct is utilized to generate large-scale, high-quality mathematical data. (2) In the post-training phase, we develop a reward model (RM) by conducting massive sampling from Qwen2-Math-Instruct. This RM is then applied to the iterative evolution of data in supervised fine-tuning (SFT). With a stronger SFT model, it's possible to iteratively train and update the RM, which in turn guides the next round of SFT data iteration. On the final SFT model, we employ the ultimate RM for reinforcement learning, resulting in the Qwen2.5-Math-Instruct. (3) Furthermore, during the inference stage, the RM is used to guide sampling, optimizing the model's performance. Qwen2.5-Math-Instruct supports both Chinese and English, and possess advanced mathematical reasoning capabilities, including Chain-of-Thought (CoT) and Tool-Integrated Reasoning (TIR). We evaluate our models on 10 mathematics datasets in both English and Chinese, such as GSM8K, MATH, GaoKao, AMC23, and AIME24, covering a range of difficulties from grade school level to math competition problems.
Related papers
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task [49.355810887265925]
We introduce MathFimer, a novel framework for mathematical reasoning step expansion.
We develop a specialized model, MathFimer-7B, on our carefully curated NuminaMath-FIM dataset.
We then apply these models to enhance existing mathematical reasoning datasets by inserting detailed intermediate steps into their solution chains.
arXiv Detail & Related papers (2025-02-17T11:22:24Z) - Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate [41.58282051139543]
Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions.
Inspired by human learning processes that emphasize critical thinking, we propose Critique Fine-Tuning (CFT)
CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT.
arXiv Detail & Related papers (2025-01-29T15:20:30Z) - URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics [25.308196207219613]
Chain-of-Thought (CoT) reasoning is widely used to enhance the mathematical reasoning capabilities of large language models (LLMs)
In this work, we propose a novel framework that introduces System 2-style thinking to multimodal mathematical reasoning.
arXiv Detail & Related papers (2025-01-08T18:49:41Z) - Qwen2.5 Technical Report [122.13958993185952]
We introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs.
Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages.
Open-weight offerings include base and instruction-tuned models, with quantized versions available.
For hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus.
arXiv Detail & Related papers (2024-12-19T17:56:09Z) - Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On [55.449818944278526]
We introduce the Skywork-Math model series, supervised fine-tuned (SFT) on common 7B language models.
Skywork-Math 7B has achieved impressive accuracies of 51.2% on the competition-level MATH benchmark.
We provide several practical takeaways to enhance math reasoning abilities in LLMs for both research and industry applications.
arXiv Detail & Related papers (2024-07-11T09:56:51Z) - MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity.
We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points.
We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z) - WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct [130.37945867605302]
We present WizardMath, which enhances the mathematical CoT reasoning abilities of large language models (LLMs) without using external python tools.
Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency.
Our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance.
arXiv Detail & Related papers (2023-08-18T14:23:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.