Related papers: Zyda: A 1.3T Dataset for Open Language Modeling

Zyda: A 1.3T Dataset for Open Language Modeling

URL: http://arxiv.org/abs/2406.01981v1
Date: Tue, 4 Jun 2024 05:47:17 GMT
Title: Zyda: A 1.3T Dataset for Open Language Modeling
Authors: Yury Tokpanov, Beren Millidge, Paolo Glorioso, Jonathan Pilault, Adam Ibrahim, James Whittington, Quentin Anthony,
Abstract summary: Zyda is a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite.
Score: 10.973515151563427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The size of large language models (LLMs) has scaled dramatically in recent years and their computational and data requirements have surged correspondingly. State-of-the-art language models, even at relatively smaller sizes, typically require training on at least a trillion tokens. This rapid advancement has eclipsed the growth of open-source datasets available for large-scale LLM pretraining. In this paper, we introduce Zyda (Zyphra Dataset), a dataset under a permissive license comprising 1.3 trillion tokens, assembled by integrating several major respected open-source datasets into a single, high-quality corpus. We apply rigorous filtering and deduplication processes, both within and across datasets, to maintain and enhance the quality derived from the original datasets. Our evaluations show that Zyda not only competes favorably with other open datasets like Dolma, FineWeb, and RefinedWeb, but also substantially improves the performance of comparable models from the Pythia suite. Our rigorous data processing methods significantly enhance Zyda's effectiveness, outperforming even the best of its constituent datasets when used independently.

Related papers

Distilling Dataset into Neural Field [12.551430414723086]
This paper proposes a novel parameterization framework for dataset distillation, coined Distilling dataset into Neural Field (DDiF) Due to the unique nature of the neural field, DDiF effectively preserves the information and easily generates various shapes of data. We demonstrate that DDiF achieves superior performance on several benchmark datasets, extending beyond the image domain to include video, audio, and 3D voxel.
arXiv Detail & Related papers (2025-03-05T14:33:29Z)
RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
SSE: Multimodal Semantic Data Selection and Enrichment for Industrial-scale Data Assimilation [29.454948190814765]
In recent years, the data collected for artificial intelligence has grown to an unmanageable amount. We propose a framework to select the most semantically diverse and important dataset portion. We further semantically enrich it by discovering meaningful new data from a massive unlabeled data pool.
arXiv Detail & Related papers (2024-09-20T19:17:52Z)
Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress. LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset. Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z)
GeMQuAD : Generating Multilingual Question Answering Datasets from Large Language Models using Few Shot Learning [4.8838210812204235]
In this paper, we propose GeMQuAD - a semi-supervised learning approach, applied to a dataset generated through ICL with just one example in the target language. We iteratively identify high-quality data to enhance model performance, especially for low-resource multilingual setting. Our framework outperforms the machine translation-augmented model by 0.22/1.68 F1/EM points for Hindi and 0.82/1.37 F1/EM points for Spanish on the MLQA dataset.
arXiv Detail & Related papers (2024-04-14T06:55:42Z)
Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small. We propose a novel method that augments training data by incorporating a wealth of examples from other datasets. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
Farzi Data: Autoregressive Data Distillation [34.39112473620335]
We study data distillation for auto-regressive machine learning tasks. We propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences.
arXiv Detail & Related papers (2023-10-15T23:23:27Z)
Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP [43.7219097444333]
We introduce a testbed of six publicly available data sources to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts. We find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source.
arXiv Detail & Related papers (2022-08-10T18:24:23Z)
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation [101.26235068460551]
Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks. Models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains. We introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner.
arXiv Detail & Related papers (2020-10-24T08:36:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.