Related papers: Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

URL: http://arxiv.org/abs/2407.16607v3
Date: Thu, 5 Sep 2024 16:39:44 GMT
Title: Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Authors: Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith,
Abstract summary: We tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers. We show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
Score: 112.0422370149713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information: byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data. Given a tokenizer's merge list along with example data for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o and Mistral NeMo's tokenizers are much more multilingual than their predecessors, training on 39% and 47% non-English language data, respectively; Llama 3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

Related papers

Low-resource Information Extraction with the European Clinical Case Corpus [4.747950273856823]
We present E3C-3.0, a multilingual dataset in the medical domain. The dataset includes both native texts in five languages and texts translated and projected from the English source into five target languages. A semi-automatic approach has been implemented, including automatic annotation projection.
arXiv Detail & Related papers (2025-03-26T14:07:40Z)
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking [6.070192392563392]
We present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes. To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge.
arXiv Detail & Related papers (2025-02-16T16:22:23Z)
Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language [34.54405113575568]
Machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual models. We show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data. We release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.
arXiv Detail & Related papers (2024-10-31T14:09:50Z)
MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression [5.5795785998430185]
MultiTok is a new tokenizing tool inspired by universal Lempel-Ziv-Welch data compression. We show that MultiTok achieves a comparable performance to the BERT standard as a tokenizer.
arXiv Detail & Related papers (2024-10-28T21:24:51Z)
Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs) During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch. Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z)
Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion [30.50032335014021]
We propose a novel TKGC model, namely Pre-trained Language Model with Prompts for TKGC (PPT) We convert a series of sampled quadruples into pre-trained language model inputs and convert intervals between timestamps into different prompts to make coherent sentences with implicit semantic information. Our model can effectively incorporate information from temporal knowledge graphs into the language models.
arXiv Detail & Related papers (2023-05-13T12:53:11Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation [5.004814662623874]
This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks.
arXiv Detail & Related papers (2023-04-03T18:19:26Z)
A Compact Pretraining Approach for Neural Language Models [21.767174489837828]
We show that pretrained NLMs learn in-domain information more effectively and faster from a compact subset of the data. We construct these compact subsets from the unstructured data using a combination of abstractive summaries and extractive keywords. Our strategy reduces pretraining time by up to five times compared to vanilla pretraining.
arXiv Detail & Related papers (2022-08-25T22:43:47Z)
Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding [69.40915115518523]
Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Various data augmentation approaches have been proposed to synthesize training data in low-resource target languages. In this paper we focus on mitigating noise in augmented data.
arXiv Detail & Related papers (2021-09-03T15:44:15Z)
Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data. Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z)
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model. We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.