Related papers: Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math

URL: http://arxiv.org/abs/2312.17120v1
Date: Thu, 28 Dec 2023 16:55:40 GMT
Title: Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math
Authors: Zengzhi Wang, Rui Xia, Pengfei Liu
Abstract summary: We introduce textscMathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Our meticulous data collection and processing efforts included a complex suite of preprocessing. We hope our textscMathPile can help to enhance the mathematical reasoning abilities of language models.
Score: 52.66190891388847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality, large-scale corpora are the cornerstone of building foundation models. In this work, we introduce \textsc{MathPile}, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. Throughout its creation, we adhered to the principle of ``\emph{less is more}'', firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, language identification, cleaning, filtering, and deduplication, ensuring the high quality of our corpus. Furthermore, we performed data contamination detection on downstream benchmark test sets to eliminate duplicates. We hope our \textsc{MathPile} can help to enhance the mathematical reasoning abilities of language models. We plan to open-source different versions of \mathpile with the scripts used for processing, to facilitate future developments in this field.

Related papers

MegaMath: Pushing the Limits of Open Math Corpora [44.148011362359036]
We present MegaMath, an open dataset curated from diverse, math-focused sources. MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.
arXiv Detail & Related papers (2025-04-03T17:52:07Z)
MIND: Math Informed syNthetic Dialogues for Pretraining LLMs [34.498175178707065]
We propose a novel large-scale and diverse Math Informed syNthetic Dialogue (MIND) generation method. MIND generates synthetic conversations based on OpenWebMath (OWM), resulting in a new math corpus, MIND-OWM. Our experiments with different conversational settings reveal that incorporating knowledge gaps between dialog participants is essential for generating high-quality math data.
arXiv Detail & Related papers (2024-10-15T18:25:53Z)
MathCoder2: Better Math Reasoning from Continued Pretraining on Model-translated Mathematical Code [38.127313175508746]
We introduce a novel method for generating mathematical code accompanied with corresponding reasoning steps for continued pretraining. Our approach begins with the construction of a high-quality mathematical continued pretraining dataset. Appending the generated code to each reasoning step results in data consisting of paired natural language reasoning steps and their corresponding code.
arXiv Detail & Related papers (2024-10-10T17:58:40Z)
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents. It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl. Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model. Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z)
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z)
Laying Anchors: Semantically Priming Numerals in Language Modeling [11.831883526217942]
We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus. We demonstrate significant improvements in the mathematical grounding of our learned embeddings.
arXiv Detail & Related papers (2024-04-02T00:02:00Z)
Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. We design a simple but effective ensemble-based framework that combines various transfer learning techniques. We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z)
JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding [74.12405417718054]
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model(PLM) Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement. We design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses.
arXiv Detail & Related papers (2022-06-13T17:03:52Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
Learning to Match Mathematical Statements with Proofs [37.38969121408295]
The task is designed to improve the processing of research-level mathematical texts. We release a dataset for the task, consisting of over 180k statement-proof pairs. We show that considering the assignment problem globally and using weighted bipartite matching algorithms helps a lot in tackling the task.
arXiv Detail & Related papers (2021-02-03T15:38:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.