MegaMath: Pushing the Limits of Open Math Corpora
- URL: http://arxiv.org/abs/2504.02807v1
- Date: Thu, 03 Apr 2025 17:52:07 GMT
- Title: MegaMath: Pushing the Limits of Open Math Corpora
- Authors: Fan Zhou, Zengzhi Wang, Nikhil Ranjan, Zhoujun Cheng, Liping Tang, Guowei He, Zhengzhong Liu, Eric P. Xing,
- Abstract summary: We present MegaMath, an open dataset curated from diverse, math-focused sources.<n>MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.
- Score: 44.148011362359036
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.
Related papers
- MIND: Math Informed syNthetic Dialogues for Pretraining LLMs [34.498175178707065]
We propose a novel large-scale and diverse Math Informed syNthetic Dialogue (MIND) generation method.
MIND generates synthetic conversations based on OpenWebMath (OWM), resulting in a new math corpus, MIND-OWM.
Our experiments with different conversational settings reveal that incorporating knowledge gaps between dialog participants is essential for generating high-quality math data.
arXiv Detail & Related papers (2024-10-15T18:25:53Z) - MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code [38.127313175508746]
We introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining.
Our approach begins with the construction of a high-quality mathematical continued pretraining dataset.
Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code.
arXiv Detail & Related papers (2024-10-10T17:58:40Z) - InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents.
It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.
Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model.
Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z) - MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving.
However, their proficiency in solving mathematical problems remains inadequate.
We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data.
arXiv Detail & Related papers (2024-03-05T11:42:59Z) - MATHSENSEI: A Tool-Augmented Large Language Model for Mathematical Reasoning [2.9104279358536647]
We present MathSensei, a tool-augmented large language model for mathematical reasoning.
We study the complementary benefits of the tools - knowledge retriever (Bing Web Search), program generator + executor (Python), and symbolic equation solver (Wolfram-Alpha API)
arXiv Detail & Related papers (2024-02-27T05:50:35Z) - InternLM-Math: Open Math Large Language Models Toward Verifiable Reasoning [98.53491178426492]
We open-source our math reasoning LLMs InternLM-Math which is continue pre-trained from InternLM2.
We unify chain-of-thought reasoning, reward modeling, formal reasoning, data augmentation, and code interpreter in a unified seq2seq format.
Our pre-trained model achieves 30.3 on the MiniF2F test set without fine-tuning.
arXiv Detail & Related papers (2024-02-09T11:22:08Z) - MathPile: A Billion-Token-Scale Pretraining Corpus for Math [45.163340937419214]
We introduce MathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens.
Our meticulous data collection and processing efforts included a complex suite of preprocessing.
We aim for our MathPile to boost language models' mathematical reasoning abilities and open-source its different versions and processing scripts to advance the field.
arXiv Detail & Related papers (2023-12-28T16:55:40Z) - OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text [32.15651290548974]
We introduce OpenWebMath, an open dataset inspired by works containing 14.7B tokens of webpages from Common Crawl.
We run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data.
arXiv Detail & Related papers (2023-10-10T16:57:28Z) - MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning [54.2093509928664]
In math reasoning with large language models, fine-tuning data augmentation by query evolution and diverse reasoning paths is empirically verified effective.
We conduct an investigation for such data augmentation in math reasoning and are intended to answer these questions.
We release our codes and augmented data in https://github.com/OFA-Sys/8k-Scel.
arXiv Detail & Related papers (2023-10-09T08:18:58Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.