Extracting Targeted Training Data from ASR Models, and How to Mitigate
It
- URL: http://arxiv.org/abs/2204.08345v1
- Date: Mon, 18 Apr 2022 14:43:17 GMT
- Title: Extracting Targeted Training Data from ASR Models, and How to Mitigate
It
- Authors: Ehsan Amid, Om Thakkar, Arun Narayanan, Rajiv Mathews, Fran\c{c}oise
Beaufays
- Abstract summary: Noise Masking is a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models.
We show that we are able to extract the correct names from masked training utterances with 11.8% accuracy.
We also show that even in a setting that uses synthetic audio and partial transcripts from the test set, our method achieves 2.5% correct name accuracy (47.7% any name success rate)
- Score: 14.82033976002072
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent work has designed methods to demonstrate that model updates in ASR
training can leak potentially sensitive attributes of the utterances used in
computing the updates. In this work, we design the first method to demonstrate
information leakage about training data from trained ASR models. We design
Noise Masking, a fill-in-the-blank style method for extracting targeted parts
of training data from trained ASR models. We demonstrate the success of Noise
Masking by using it in four settings for extracting names from the LibriSpeech
dataset used for training a SOTA Conformer model. In particular, we show that
we are able to extract the correct names from masked training utterances with
11.8% accuracy, while the model outputs some name from the train set 55.2% of
the time. Further, we show that even in a setting that uses synthetic audio and
partial transcripts from the test set, our method achieves 2.5% correct name
accuracy (47.7% any name success rate). Lastly, we design Word Dropout, a data
augmentation method that we show when used in training along with MTR, provides
comparable utility as the baseline, along with significantly mitigating
extraction via Noise Masking across the four evaluated settings.
Related papers
- Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets.
We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z) - Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent.
We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements.
Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z) - CovarNav: Machine Unlearning via Model Inversion and Covariance
Navigation [11.222501077070765]
Machine unlearning has emerged as an essential technique to selectively remove the influence of specific training data points on trained models.
We introduce a three-step process, named CovarNav, to facilitate this forgetting.
We rigorously evaluate CovarNav on the CIFAR-10 and Vggface2 datasets.
arXiv Detail & Related papers (2023-11-21T21:19:59Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Ethicist: Targeted Training Data Extraction Through Loss Smoothed Soft
Prompting and Calibrated Confidence Estimation [56.57532238195446]
We propose a method named Ethicist for targeted training data extraction.
To elicit memorization, we tune soft prompt embeddings while keeping the model fixed.
We show that Ethicist significantly improves the extraction performance on a recently proposed public benchmark.
arXiv Detail & Related papers (2023-07-10T08:03:41Z) - Boosting Facial Expression Recognition by A Semi-Supervised Progressive
Teacher [54.50747989860957]
We propose a semi-supervised learning algorithm named Progressive Teacher (PT) to utilize reliable FER datasets as well as large-scale unlabeled expression images for effective training.
Experiments on widely-used databases RAF-DB and FERPlus validate the effectiveness of our method, which achieves state-of-the-art performance with accuracy of 89.57% on RAF-DB.
arXiv Detail & Related papers (2022-05-28T07:47:53Z) - Noisy Training Improves E2E ASR for the Edge [22.91184103295888]
Automatic speech recognition (ASR) has become increasingly ubiquitous on modern edge devices.
E2E ASR models are prone to overfitting and have difficulties in generalizing to unseen testing data.
We present a simple yet effective noisy training strategy to further improve the E2E ASR model training.
arXiv Detail & Related papers (2021-07-09T20:56:20Z) - A Method to Reveal Speaker Identity in Distributed ASR Training, and How
to Counter It [3.18475216176047]
We design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient.
We show that it is possible to reveal the speaker's identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset.
arXiv Detail & Related papers (2021-04-15T23:15:12Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Incremental Learning for End-to-End Automatic Speech Recognition [41.297106772785206]
We propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR)
We design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions.
Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting.
arXiv Detail & Related papers (2020-05-11T08:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.