Related papers: Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?

URL: http://arxiv.org/abs/2407.16607v2
Date: Wed, 24 Jul 2024 23:34:21 GMT
Title: Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
Authors: Jonathan Hayase, Alisa Liu, Yejin Choi, Sewoong Oh, Noah A. Smith,
Abstract summary: We tackle a task which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers. We show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources.
Score: 112.0422370149713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The pretraining data of today's strongest language models is opaque; in particular, little is known about the proportions of various domains or languages represented. In this work, we tackle a task which we call data mixture inference, which aims to uncover the distributional make-up of training data. We introduce a novel attack based on a previously overlooked source of information -- byte-pair encoding (BPE) tokenizers, used by the vast majority of modern language models. Our key insight is that the ordered list of merge rules learned by a BPE tokenizer naturally reveals information about the token frequencies in its training data: the first merge is the most common byte pair, the second is the most common pair after merging the first token, and so on. Given a tokenizer's merge list along with data samples for each category of interest, we formulate a linear program that solves for the proportion of each category in the tokenizer's training set. Importantly, to the extent to which tokenizer training data is representative of the pretraining data, we indirectly learn about pretraining data. In controlled experiments, we show that our attack recovers mixture ratios with high precision for tokenizers trained on known mixtures of natural languages, programming languages, and data sources. We then apply our approach to off-the-shelf tokenizers released with recent LMs. We confirm much publicly disclosed information about these models, and also make several new inferences: GPT-4o's tokenizer is much more multilingual than its predecessors, training on 39% non-English data; Llama3 extends GPT-3.5's tokenizer primarily for multilingual (48%) use; GPT-3.5's and Claude's tokenizers are trained on predominantly code (~60%). We hope our work sheds light on current design practices for pretraining data, and inspires continued research into data mixture inference for LMs.

Related papers

Low-resource Information Extraction with the European Clinical Case Corpus [4.747950273856823]
We present E3C-3.0, a multilingual dataset in the medical domain. The dataset includes both native texts in five languages and texts translated and projected from the English source into five target languages. A semi-automatic approach has been implemented, including automatic annotation projection.
arXiv Detail & Related papers (2025-03-26T14:07:40Z)
Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models [70.78205685001168]
We investigate knowledge forgetting in large language models with a focus on its generalisation.<n> UGBench is the first benchmark specifically designed to assess the unlearning of in-scope implicit knowledge.<n>We propose PerMU, a novel probability-based unlearning paradigm.
arXiv Detail & Related papers (2025-02-27T11:03:33Z)
TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking [6.070192392563392]
We present TituLLMs, the first large pretrained Bangla LLMs, available in 1b and 3b parameter sizes. To train TituLLMs, we collected a pretraining dataset of approximately 37 billion tokens. We extended the Llama-3.2 tokenizer to incorporate language- and culture-specific knowledge.
arXiv Detail & Related papers (2025-02-16T16:22:23Z)
Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language [34.54405113575568]
Machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual models. We show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data. We release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.
arXiv Detail & Related papers (2024-10-31T14:09:50Z)
MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression [5.5795785998430185]
MultiTok is a new tokenizing tool inspired by universal Lempel-Ziv-Welch data compression. We show that MultiTok achieves a comparable performance to the BERT standard as a tokenizer.
arXiv Detail & Related papers (2024-10-28T21:24:51Z)
Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs) During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch. Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z)
Pre-trained Language Model with Prompts for Temporal Knowledge Graph Completion [30.50032335014021]
We propose a novel TKGC model, namely Pre-trained Language Model with Prompts for TKGC (PPT) We convert a series of sampled quadruples into pre-trained language model inputs and convert intervals between timestamps into different prompts to make coherent sentences with implicit semantic information. Our model can effectively incorporate information from temporal knowledge graphs into the language models.
arXiv Detail & Related papers (2023-05-13T12:53:11Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation [5.004814662623874]
This paper introduces a novel semi-supervised method, SPDG, that generates high-quality pseudo-parallel data for multilingual pre-training. Our experiments show that PEACH outperforms existing approaches used in training mT5 and mBART on various translation tasks.
arXiv Detail & Related papers (2023-04-03T18:19:26Z)
A Compact Pretraining Approach for Neural Language Models [21.767174489837828]
We show that pretrained NLMs learn in-domain information more effectively and faster from a compact subset of the data. We construct these compact subsets from the unstructured data using a combination of abstractive summaries and extractive keywords. Our strategy reduces pretraining time by up to five times compared to vanilla pretraining.
arXiv Detail & Related papers (2022-08-25T22:43:47Z)
Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding [69.40915115518523]
Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Various data augmentation approaches have been proposed to synthesize training data in low-resource target languages. In this paper we focus on mitigating noise in augmented data.
arXiv Detail & Related papers (2021-09-03T15:44:15Z)
Ranking Creative Language Characteristics in Small Data Scenarios [52.00161818003478]
We adapt the DirectRanker to provide a new deep model for ranking creative language with small data. Our experiments with sparse training data show that while the performance of standard neural ranking approaches collapses with small datasets, DirectRanker remains effective.
arXiv Detail & Related papers (2020-10-23T18:57:47Z)
Pre-training Multilingual Neural Machine Translation by Leveraging Alignment Information [72.2412707779571]
mRASP is an approach to pre-train a universal multilingual neural machine translation model. We carry out experiments on 42 translation directions across a diverse setting, including low, medium, rich resource, and as well as transferring to exotic language pairs.
arXiv Detail & Related papers (2020-10-07T03:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.