Finding Memo: Extractive Memorization in Constrained Sequence Generation
Tasks
- URL: http://arxiv.org/abs/2210.12929v1
- Date: Mon, 24 Oct 2022 03:01:52 GMT
- Title: Finding Memo: Extractive Memorization in Constrained Sequence Generation
Tasks
- Authors: Vikas Raunak and Arul Menezes
- Abstract summary: Memorization presents a challenge for several constrained Natural Language Generation (NLG) tasks such as Neural Machine Translation (NMT)
We propose a new, inexpensive algorithm for extractive memorization in constrained sequence generation tasks.
We develop a simple algorithm which elicits non-memorized translations of memorized samples from the same model.
- Score: 12.478605921259403
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Memorization presents a challenge for several constrained Natural Language
Generation (NLG) tasks such as Neural Machine Translation (NMT), wherein the
proclivity of neural models to memorize noisy and atypical samples reacts
adversely with the noisy (web crawled) datasets. However, previous studies of
memorization in constrained NLG tasks have only focused on counterfactual
memorization, linking it to the problem of hallucinations. In this work, we
propose a new, inexpensive algorithm for extractive memorization (exact
training data generation under insufficient context) in constrained sequence
generation tasks and use it to study extractive memorization and its effects in
NMT. We demonstrate that extractive memorization poses a serious threat to NMT
reliability by qualitatively and quantitatively characterizing the memorized
samples as well as the model behavior in their vicinity. Based on empirical
observations, we develop a simple algorithm which elicits non-memorized
translations of memorized samples from the same model, for a large fraction of
such samples. Finally, we show that the proposed algorithm could also be
leveraged to mitigate memorization in the model through finetuning. We have
released the code to reproduce our results at
https://github.com/vyraun/Finding-Memo.
Related papers
- The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization [15.348047288817478]
We analyze the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling.
An increase of the nucleus size reduces memorization only modestly.
Even when models do not engage in "hard" memorization, they may still display "soft" memorization.
arXiv Detail & Related papers (2024-08-29T08:30:33Z) - Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications.
We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences.
We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z) - Reducing Training Sample Memorization in GANs by Training with
Memorization Rejection [80.0916819303573]
We propose rejection memorization, a training scheme that rejects generated samples that are near-duplicates of training samples during training.
Our scheme is simple, generic and can be directly applied to any GAN architecture.
arXiv Detail & Related papers (2022-10-21T20:17:50Z) - Measures of Information Reflect Memorization Patterns [53.71420125627608]
We show that the diversity in the activation patterns of different neurons is reflective of model generalization and memorization.
Importantly, we discover that information organization points to the two forms of memorization, even for neural activations computed on unlabelled in-distribution examples.
arXiv Detail & Related papers (2022-10-17T20:15:24Z) - Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim.
This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others).
We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z) - Automatic Recall Machines: Internal Replay, Continual Learning and the
Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity.
We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective.
Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z) - Encoding-based Memory Modules for Recurrent Neural Networks [79.42778415729475]
We study the memorization subtask from the point of view of the design and training of recurrent neural networks.
We propose a new model, the Linear Memory Network, which features an encoding-based memorization component built with a linear autoencoder for sequences.
arXiv Detail & Related papers (2020-01-31T11:14:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.