Related papers: Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks

Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks

URL: http://arxiv.org/abs/2210.12929v1
Date: Mon, 24 Oct 2022 03:01:52 GMT
Title: Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks
Authors: Vikas Raunak and Arul Menezes
Abstract summary: Memorization presents a challenge for several constrained Natural Language Generation (NLG) tasks such as Neural Machine Translation (NMT) We propose a new, inexpensive algorithm for extractive memorization in constrained sequence generation tasks. We develop a simple algorithm which elicits non-memorized translations of memorized samples from the same model.
Score: 12.478605921259403
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memorization presents a challenge for several constrained Natural Language Generation (NLG) tasks such as Neural Machine Translation (NMT), wherein the proclivity of neural models to memorize noisy and atypical samples reacts adversely with the noisy (web crawled) datasets. However, previous studies of memorization in constrained NLG tasks have only focused on counterfactual memorization, linking it to the problem of hallucinations. In this work, we propose a new, inexpensive algorithm for extractive memorization (exact training data generation under insufficient context) in constrained sequence generation tasks and use it to study extractive memorization and its effects in NMT. We demonstrate that extractive memorization poses a serious threat to NMT reliability by qualitatively and quantitatively characterizing the memorized samples as well as the model behavior in their vicinity. Based on empirical observations, we develop a simple algorithm which elicits non-memorized translations of memorized samples from the same model, for a large fraction of such samples. Finally, we show that the proposed algorithm could also be leveraged to mitigate memorization in the model through finetuning. We have released the code to reproduce our results at https://github.com/vyraun/Finding-Memo.

Related papers

Entropy-Memorization Law: Evaluating Memorization Difficulty of Data in LLMs [30.55956806927529]
Large Language Models (LLMs) are known to memorize portions of their training data, sometimes reproducing content verbatim when prompted appropriately.<n>We present the Entropy-Memorization Law, which suggests that data entropy is linearly correlated with memorization score.<n>Adopting the same strategy to discover Entropy-Memorization Law, we derive a simple yet effective approach to distinguish training and testing data.
arXiv Detail & Related papers (2025-07-08T14:58:28Z)
A Closer Look on Memorization in Tabular Diffusion Model: A Data-Centric Perspective [15.33961902853653]
We quantify memorization for each real sample based on how many generated samples are flagged as replicas.<n>Our empirical analysis reveals a heavy-tailed distribution of memorization counts.<n>We propose DynamicCut, a two-stage, model-agnostic mitigation method.
arXiv Detail & Related papers (2025-05-28T13:06:00Z)
Redistribute Ensemble Training for Mitigating Memorization in Diffusion Models [31.92526915009259]
Diffusion models are known for their tremendous ability to generate high-quality samples. Recent methods for memory mitigation have primarily addressed the issue within the context of the text modality. We propose a novel method for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization.
arXiv Detail & Related papers (2025-02-13T15:56:44Z)
The Unreasonable Ineffectiveness of Nucleus Sampling on Mitigating Text Memorization [15.348047288817478]
We analyze the text memorization behavior of large language models (LLMs) when subjected to nucleus sampling. An increase of the nucleus size reduces memorization only modestly. Even when models do not engage in "hard" memorization, they may still display "soft" memorization.
arXiv Detail & Related papers (2024-08-29T08:30:33Z)
Demystifying Verbatim Memorization in Large Language Models [67.49068128909349]
Large Language Models (LLMs) frequently memorize long sequences verbatim, often with serious legal and privacy implications. We develop a framework to study verbatim memorization in a controlled setting by continuing pre-training from Pythia checkpoints with injected sequences. We find that (1) non-trivial amounts of repetition are necessary for verbatim memorization to happen; (2) later (and presumably better) checkpoints are more likely to memorize verbatim sequences, even for out-of-distribution sequences.
arXiv Detail & Related papers (2024-07-25T07:10:31Z)
Reducing Training Sample Memorization in GANs by Training with Memorization Rejection [80.0916819303573]
We propose rejection memorization, a training scheme that rejects generated samples that are near-duplicates of training samples during training. Our scheme is simple, generic and can be directly applied to any GAN architecture.
arXiv Detail & Related papers (2022-10-21T20:17:50Z)
Measures of Information Reflect Memorization Patterns [53.71420125627608]
We show that the diversity in the activation patterns of different neurons is reflective of model generalization and memorization. Importantly, we discover that information organization points to the two forms of memorization, even for neural activations computed on unlabelled in-distribution examples.
arXiv Detail & Related papers (2022-10-17T20:15:24Z)
Quantifying Memorization Across Neural Language Models [61.58529162310382]
Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data.
arXiv Detail & Related papers (2022-02-15T18:48:31Z)
Automatic Recall Machines: Internal Replay, Continual Learning and the Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity. We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective. Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z)
Encoding-based Memory Modules for Recurrent Neural Networks [79.42778415729475]
We study the memorization subtask from the point of view of the design and training of recurrent neural networks. We propose a new model, the Linear Memory Network, which features an encoding-based memorization component built with a linear autoencoder for sequences.
arXiv Detail & Related papers (2020-01-31T11:14:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.