How BPE Affects Memorization in Transformers
- URL: http://arxiv.org/abs/2110.02782v1
- Date: Wed, 6 Oct 2021 14:01:56 GMT
- Title: How BPE Affects Memorization in Transformers
- Authors: Eugene Kharitonov and Marco Baroni and Dieuwke Hupkes
- Abstract summary: We show that the size of the subword vocabulary learned by Byte-Pair QA (BPE) greatly affects both ability and tendency of standard Transformer models to memorize training data.
We conjecture this effect is caused by reduction in the sequences' length that happens as the BPE vocabulary grows.
- Score: 36.53583838619203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training data memorization in NLP can both be beneficial (e.g., closed-book
QA) and undesirable (personal data extraction). In any case, successful model
training requires a non-trivial amount of memorization to store word spellings,
various linguistic idiosyncrasies and common knowledge. However, little is
known about what affects the memorization behavior of NLP models, as the field
tends to focus on the equally important question of generalization. In this
work, we demonstrate that the size of the subword vocabulary learned by
Byte-Pair Encoding (BPE) greatly affects both ability and tendency of standard
Transformer models to memorize training data, even when we control for the
number of learned parameters. We find that with a large subword vocabulary
size, Transformer models fit random mappings more easily and are more
vulnerable to membership inference attacks. Similarly, given a prompt,
Transformer-based language models with large subword vocabularies reproduce the
training data more often. We conjecture this effect is caused by reduction in
the sequences' length that happens as the BPE vocabulary grows. Our findings
can allow a more informed choice of hyper-parameters, that is better tailored
for a particular use-case.
Related papers
- Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Rethinking LLM Memorization through the Lens of Adversarial Compression [93.13830893086681]
Large language models (LLMs) trained on web-scale datasets raise substantial concerns regarding permissible data usage.
One major question is whether these models "memorize" all their training data or they integrate many data sources in some way more akin to how a human would learn and synthesize information.
We propose the Adversarial Compression Ratio (ACR) as a metric for assessing memorization in LLMs.
arXiv Detail & Related papers (2024-04-23T15:49:37Z) - Memory Augmented Lookup Dictionary based Language Modeling for Automatic
Speech Recognition [20.926163659469587]
We propose a new memory augmented lookup dictionary based Transformer architecture for LM.
The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens.
Our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate.
arXiv Detail & Related papers (2022-12-30T22:26:57Z) - Memorization Without Overfitting: Analyzing the Training Dynamics of
Large Language Models [64.22311189896888]
We study exact memorization in causal and masked language modeling, across model sizes and throughout the training process.
Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process.
arXiv Detail & Related papers (2022-05-22T07:43:50Z) - Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim.
This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others).
We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z) - Counterfactual Memorization in Neural Language Models [91.8747020391287]
Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data.
An open question in previous studies of language model memorization is how to filter out "common" memorization.
We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training.
arXiv Detail & Related papers (2021-12-24T04:20:57Z) - Deep Transformer based Data Augmentation with Subword Units for
Morphologically Rich Online ASR [0.0]
Deep Transformer models have proven to be particularly powerful in language modeling tasks for ASR.
Recent studies showed that a considerable part of the knowledge of neural network Language Models (LM) can be transferred to traditional n-grams by using neural text generation based data augmentation.
We show that although data augmentation with Transformer-generated text works well for isolating languages, it causes a vocabulary explosion in a morphologically rich language.
We propose a new method called subword-based neural text augmentation, where we retokenize the generated text into statistically derived subwords.
arXiv Detail & Related papers (2020-07-14T10:22:05Z) - Do sequence-to-sequence VAEs learn global features of sentences? [13.43800646539014]
We study the Varienational Autoencoder (VAE) for natural language with the sequence-to-sequence architecture.
We find that VAEs are prone to memorizing the first words and the sentence length, producing local features of limited usefulness.
These variants learn latent variables that are more global, i.e., more predictive of topic or sentiment labels.
arXiv Detail & Related papers (2020-04-16T14:43:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.