AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
- URL: http://arxiv.org/abs/2412.15084v2
- Date: Fri, 17 Jan 2025 07:12:55 GMT
- Title: AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling
- Authors: Zihan Liu, Yang Chen, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping,
- Abstract summary: AceMath is a suite of frontier math models that excel in solving complex math problems.<n>We develop AceMath-72B-Instruct and AceMath-72B-RM as reward models.<n>When combining AceMath-72B-RM with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks.
- Score: 46.51639868437127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce AceMath, a suite of frontier math models that excel in solving complex math problems, along with highly effective reward models capable of evaluating generated solutions and reliably identifying the correct ones. To develop the instruction-tuned math models, we propose a supervised fine-tuning (SFT) process that first achieves competitive performance across general domains, followed by targeted fine-tuning for the math domain using a carefully curated set of prompts and synthetically generated responses. The resulting model, AceMath-72B-Instruct greatly outperforms Qwen2.5-Math-72B-Instruct, GPT-4o and Claude-3.5 Sonnet. To develop math-specialized reward model, we first construct AceMath-RewardBench, a comprehensive and robust benchmark for evaluating math reward models across diverse problems and difficulty levels. After that, we present a systematic approach to build our math reward models. The resulting model, AceMath-72B-RM, consistently outperforms state-of-the-art reward models. Furthermore, when combining AceMath-72B-Instruct with AceMath-72B-RM, we achieve the highest average rm@8 score across the math reasoning benchmarks. We release model weights, training data, and evaluation benchmarks at: https://research.nvidia.com/labs/adlr/acemath
Related papers
- MathClean: A Benchmark for Synthetic Mathematical Data Cleaning [33.34499387060138]
Math questions and answers can introduce inaccuracies, which may degrade both the training data and web data.
In this paper, we propose the MathClean benchmark to evaluate the effectiveness of math data cleaning models.
Our results demonstrate that even strong models like GPT-o1 and DeepSeek-R1 perform poorly on this benchmark.
arXiv Detail & Related papers (2025-02-26T11:17:50Z) - UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts [7.856746367263317]
This paper introduces the UTMath Benchmark, a robust evaluation framework designed to assess Large Language Models.
It comprises 1,053 cutting-edge problems spanning nine mathematical domains, with an average of 68 test cases per problem.
The best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%.
arXiv Detail & Related papers (2024-11-11T18:59:02Z) - DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning [24.68321102981711]
We introduce a series of large language models (LLMs) that employ the Decomposition of thought with code assistance and self-correction for mathematical reasoning, dubbed as DotaMath.
DotaMath models tackle complex mathematical tasks by decomposing them into simpler logical subtasks, leveraging code to solve these subtasks, and engaging in self-reflection and correction.
We train a series of base LLMs using imitation learning on DotaMathQA, resulting in DotaMath models that achieve remarkable performance compared to open-source LLMs.
arXiv Detail & Related papers (2024-07-04T17:39:16Z) - MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving.
However, their proficiency in solving mathematical problems remains inadequate.
We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data.
arXiv Detail & Related papers (2024-03-05T11:42:59Z) - InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning [98.53491178426492]
We open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2.
We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format.
Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning.
arXiv Detail & Related papers (2024-02-09T11:22:08Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z) - MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models [91.66694225955872]
We propose MetaMath, a fine-tuned language model that specializes in mathematical reasoning.
Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge.
We release all the MetaMathQA dataset, the MetaMath models with different model sizes and the training code for public use.
arXiv Detail & Related papers (2023-09-21T17:45:42Z) - MAmmoTH: Building Math Generalist Models through Hybrid Instruction
Tuning [60.208045804204076]
We introduce MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving.
The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset.
arXiv Detail & Related papers (2023-09-11T17:47:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.