Unintended Memorization in Large ASR Models, and How to Mitigate It
- URL: http://arxiv.org/abs/2310.11739v1
- Date: Wed, 18 Oct 2023 06:45:49 GMT
- Title: Unintended Memorization in Large ASR Models, and How to Mitigate It
- Authors: Lun Wang, Om Thakkar, Rajiv Mathews
- Abstract summary: auditing memorization in large non-auto-regressive automatic speech recognition (ASR) models has been challenging.
We design a simple auditing method to measure memorization in large ASR models without the extra compute overhead.
We show that in large-scale distributed training, clipping the average gradient on each compute core maintains neutral model quality and compute cost.
- Score: 16.047859326721046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is well-known that neural networks can unintentionally memorize their
training examples, causing privacy concerns. However, auditing memorization in
large non-auto-regressive automatic speech recognition (ASR) models has been
challenging due to the high compute cost of existing methods such as hardness
calibration. In this work, we design a simple auditing method to measure
memorization in large ASR models without the extra compute overhead.
Concretely, we speed up randomly-generated utterances to create a mapping
between vocal and text information that is difficult to learn from typical
training examples. Hence, accurate predictions only for sped-up training
examples can serve as clear evidence for memorization, and the corresponding
accuracy can be used to measure memorization. Using the proposed method, we
showcase memorization in the state-of-the-art ASR models. To mitigate
memorization, we tried gradient clipping during training to bound the influence
of any individual example on the final model. We empirically show that clipping
each example's gradient can mitigate memorization for sped-up training examples
with up to 16 repetitions in the training set. Furthermore, we show that in
large-scale distributed training, clipping the average gradient on each compute
core maintains neutral model quality and compute cost while providing strong
privacy protection.
Related papers
- Predicting and analyzing memorization within fine-tuned Large Language Models [0.0]
Large Language Models memorize a significant proportion of their training data, posing a serious threat when disclosed at inference time.
We propose a new approach based on sliced mutual information to detect memorized samples a priori.
We obtain strong empirical results, paving the way for systematic inspection and protection of these vulnerable samples before memorization happens.
arXiv Detail & Related papers (2024-09-27T15:53:55Z) - Detecting, Explaining, and Mitigating Memorization in Diffusion Models [49.438362005962375]
We introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions.
Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step.
Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization.
arXiv Detail & Related papers (2024-07-31T16:13:29Z) - Mitigating Approximate Memorization in Language Models via Dissimilarity
Learned Policy [0.0]
Large Language models (LLMs) are trained on large amounts of data.
LLMs showed to memorize parts of the training data and emit those data verbatim when an adversary prompts appropriately.
arXiv Detail & Related papers (2023-05-02T15:53:28Z) - Reducing Training Sample Memorization in GANs by Training with
Memorization Rejection [80.0916819303573]
We propose rejection memorization, a training scheme that rejects generated samples that are near-duplicates of training samples during training.
Our scheme is simple, generic and can be directly applied to any GAN architecture.
arXiv Detail & Related papers (2022-10-21T20:17:50Z) - Decoupling Knowledge from Memorization: Retrieval-augmented Prompt
Learning [113.58691755215663]
We develop RetroPrompt to help a model strike a balance between generalization and memorization.
In contrast with vanilla prompt learning, RetroPrompt constructs an open-book knowledge-store from training instances.
Extensive experiments demonstrate that RetroPrompt can obtain better performance in both few-shot and zero-shot settings.
arXiv Detail & Related papers (2022-05-29T16:07:30Z) - Counterfactual Memorization in Neural Language Models [91.8747020391287]
Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data.
An open question in previous studies of language model memorization is how to filter out "common" memorization.
We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training.
arXiv Detail & Related papers (2021-12-24T04:20:57Z) - Exploring Memorization in Adversarial Training [58.38336773082818]
We investigate the memorization effect in adversarial training (AT) for promoting a deeper understanding of capacity, convergence, generalization, and especially robust overfitting.
We propose a new mitigation algorithm motivated by detailed memorization analyses.
arXiv Detail & Related papers (2021-06-03T05:39:57Z) - Automatic Recall Machines: Internal Replay, Continual Learning and the
Brain [104.38824285741248]
Replay in neural networks involves training on sequential data with memorized samples, which counteracts forgetting of previous behavior caused by non-stationarity.
We present a method where these auxiliary samples are generated on the fly, given only the model that is being trained for the assessed objective.
Instead the implicit memory of learned samples within the assessed model itself is exploited.
arXiv Detail & Related papers (2020-06-22T15:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.