Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale
Pretraining Corpus for Math
- URL: http://arxiv.org/abs/2312.17120v1
- Date: Thu, 28 Dec 2023 16:55:40 GMT
- Title: Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale
Pretraining Corpus for Math
- Authors: Zengzhi Wang, Rui Xia, Pengfei Liu
- Abstract summary: We introduce textscMathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens.
Our meticulous data collection and processing efforts included a complex suite of preprocessing.
We hope our textscMathPile can help to enhance the mathematical reasoning abilities of language models.
- Score: 52.66190891388847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality, large-scale corpora are the cornerstone of building foundation
models. In this work, we introduce \textsc{MathPile}, a diverse and
high-quality math-centric corpus comprising about 9.5 billion tokens.
Throughout its creation, we adhered to the principle of ``\emph{less is
more}'', firmly believing in the supremacy of data quality over quantity, even
in the pre-training phase. Our meticulous data collection and processing
efforts included a complex suite of preprocessing, prefiltering, language
identification, cleaning, filtering, and deduplication, ensuring the high
quality of our corpus. Furthermore, we performed data contamination detection
on downstream benchmark test sets to eliminate duplicates. We hope our
\textsc{MathPile} can help to enhance the mathematical reasoning abilities of
language models. We plan to open-source different versions of \mathpile with
the scripts used for processing, to facilitate future developments in this
field.
Related papers
- LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose textbfMath-Minos, a natural language feedback enhanced verifier.
Our experiments reveal that a small set (30k) of natural language feedbacks can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Laying Anchors: Semantically Priming Numerals in Language Modeling [11.831883526217942]
We introduce strategies to semantically prime numerals in any corpus by generating anchors governed by the distribution of numerals in said corpus.
We demonstrate significant improvements in the mathematical grounding of our learned embeddings.
arXiv Detail & Related papers (2024-04-02T00:02:00Z) - Autonomous Data Selection with Language Models for Mathematical Texts [13.789739307267952]
We introduce a novel strategy that leverages base language models for autonomous data selection.
Our approach utilizes meta-prompted language models as zero-shot verifiers to evaluate and select high-quality mathematical content autonomously.
Our method showcases a 2 times increase in pretraining token efficiency compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-12T13:09:21Z) - OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text [32.15651290548974]
We introduce OpenWebMath, an open dataset inspired by works containing 14.7B tokens of webpages from Common Crawl.
We run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data.
arXiv Detail & Related papers (2023-10-10T16:57:28Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem
Understanding [74.12405417718054]
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model(PLM)
Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement.
We design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses.
arXiv Detail & Related papers (2022-06-13T17:03:52Z) - Learning to Match Mathematical Statements with Proofs [37.38969121408295]
The task is designed to improve the processing of research-level mathematical texts.
We release a dataset for the task, consisting of over 180k statement-proof pairs.
We show that considering the assignment problem globally and using weighted bipartite matching algorithms helps a lot in tackling the task.
arXiv Detail & Related papers (2021-02-03T15:38:54Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.