OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
- URL: http://arxiv.org/abs/2310.06786v1
- Date: Tue, 10 Oct 2023 16:57:28 GMT
- Title: OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text
- Authors: Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba
- Abstract summary: We introduce OpenWebMath, an open dataset inspired by works containing 14.7B tokens of webpages from Common Crawl.
We run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data.
- Score: 32.15651290548974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There is growing evidence that pretraining on high quality, carefully
thought-out tokens such as code or mathematics plays an important role in
improving the reasoning abilities of large language models. For example,
Minerva, a PaLM model finetuned on billions of tokens of mathematical documents
from arXiv and the web, reported dramatically improved performance on problems
that require quantitative reasoning. However, because all known open source web
datasets employ preprocessing that does not faithfully preserve mathematical
notation, the benefits of large scale training on quantitive web documents are
unavailable to the research community. We introduce OpenWebMath, an open
dataset inspired by these works containing 14.7B tokens of mathematical
webpages from Common Crawl. We describe in detail our method for extracting
text and LaTeX content and removing boilerplate from HTML documents, as well as
our methods for quality filtering and deduplication. Additionally, we run
small-scale experiments by training 1.4B parameter language models on
OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass
the performance of models trained on over 20x the amount of general language
data. We hope that our dataset, openly released on the Hugging Face Hub, will
help spur advances in the reasoning abilities of large language models.
Related papers
- InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents.
It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.
Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model.
Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Textbooks Are All You Need II: phi-1.5 technical report [55.6940110946465]
We create a new 1.3 billion parameter model named textbfphi-1.5 with performance on natural language tasks comparable to models 5x larger.
textbfphi-1.5 exhibits many of the traits of much larger Large Language Models.
We open-source textbfphi-1.5 to promote further research on these urgent topics.
arXiv Detail & Related papers (2023-09-11T14:01:45Z) - WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct [128.89645483139236]
We present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math.
Our model even surpasses ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci, PaLM-1 and GPT-3 on MATH.
arXiv Detail & Related papers (2023-08-18T14:23:21Z) - The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only [48.498376125522114]
We show that properly filtered and deduplicated web data alone can lead to powerful models.
We release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
arXiv Detail & Related papers (2023-06-01T20:03:56Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z) - Self-Supervised Pretraining of Graph Neural Network for the Retrieval of
Related Mathematical Expressions in Scientific Articles [8.942112181408156]
We propose a new approach for retrieval of mathematical expressions based on machine learning.
We design an unsupervised representation learning task that combines embedding learning with self-supervised learning.
We collect a huge dataset with over 29 million mathematical expressions from over 900,000 publications published on arXiv.org.
arXiv Detail & Related papers (2022-08-22T12:11:30Z) - Lessons from Deep Learning applied to Scholarly Information Extraction:
What Works, What Doesn't, and Future Directions [12.62863659147376]
We show how EneRex can extract key insights from a large-scale dataset in the domain of computer science.
We highlight how the existing datasets are limited in their capacity and how EneRex may fit into an existing knowledge graph.
arXiv Detail & Related papers (2022-07-08T17:37:56Z) - Extracting Training Data from Large Language Models [78.3839333127544]
This paper demonstrates that an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.
arXiv Detail & Related papers (2020-12-14T18:39:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.