A Method to Reveal Speaker Identity in Distributed ASR Training, and How
to Counter It
- URL: http://arxiv.org/abs/2104.07815v1
- Date: Thu, 15 Apr 2021 23:15:12 GMT
- Title: A Method to Reveal Speaker Identity in Distributed ASR Training, and How
to Counter It
- Authors: Trung Dang, Om Thakkar, Swaroop Ramaswamy, Rajiv Mathews, Peter Chin,
Fran\c{c}oise Beaufays
- Abstract summary: We design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient.
We show that it is possible to reveal the speaker's identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset.
- Score: 3.18475216176047
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: End-to-end Automatic Speech Recognition (ASR) models are commonly trained
over spoken utterances using optimization methods like Stochastic Gradient
Descent (SGD). In distributed settings like Federated Learning, model training
requires transmission of gradients over a network. In this work, we design the
first method for revealing the identity of the speaker of a training utterance
with access only to a gradient. We propose Hessian-Free Gradients Matching, an
input reconstruction technique that operates without second derivatives of the
loss function (required in prior works), which can be expensive to compute. We
show the effectiveness of our method using the DeepSpeech model architecture,
demonstrating that it is possible to reveal the speaker's identity with 34%
top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset. Further, we
study the effect of two well-known techniques, Differentially Private SGD and
Dropout, on the success of our method. We show that a dropout rate of 0.2 can
reduce the speaker identity accuracy to 0% top-1 (0.5% top-5).
Related papers
- Bag of Tricks for Effective Language Model Pretraining and Downstream
Adaptation: A Case Study on GLUE [93.98660272309974]
This report briefly describes our submission Vega v1 on the General Language Understanding Evaluation leaderboard.
GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference.
With our optimized pretraining and fine-tuning strategies, our 1.3 billion model sets new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
arXiv Detail & Related papers (2023-02-18T09:26:35Z) - Selective In-Context Data Augmentation for Intent Detection using
Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model.
Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents.
Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z) - Guided contrastive self-supervised pre-training for automatic speech
recognition [16.038298927903632]
Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model.
We present a novel modification of CPC called Guided Contrastive Predictive Coding (GCPC)
Our proposed method maximizes the mutual information between representations from a prior-knowledge model and the output of the model being pre-trained, allowing prior knowledge injection during pre-training.
arXiv Detail & Related papers (2022-10-22T02:38:43Z) - Extracting Targeted Training Data from ASR Models, and How to Mitigate
It [14.82033976002072]
Noise Masking is a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models.
We show that we are able to extract the correct names from masked training utterances with 11.8% accuracy.
We also show that even in a setting that uses synthetic audio and partial transcripts from the test set, our method achieves 2.5% correct name accuracy (47.7% any name success rate)
arXiv Detail & Related papers (2022-04-18T14:43:17Z) - RescoreBERT: Discriminative Speech Recognition Rescoring with BERT [21.763672436079872]
We show how to train a BERT-based rescoring model with MWER loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR.
We name this approach RescoreBERT, and evaluate it on the LibriSpeech corpus, and it reduces WER by 6.6%/3.4% relative on clean/other test sets over a BERT baseline without discriminative objective.
arXiv Detail & Related papers (2022-02-02T15:45:26Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Diverse Knowledge Distillation for End-to-End Person Search [81.4926655119318]
Person search aims to localize and identify a specific person from a gallery of images.
Recent methods can be categorized into two groups, i.e., two-step and end-to-end approaches.
We propose a simple yet strong end-to-end network with diverse knowledge distillation to break the bottleneck.
arXiv Detail & Related papers (2020-12-21T09:04:27Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - Incremental Learning for End-to-End Automatic Speech Recognition [41.297106772785206]
We propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR)
We design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions.
Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting.
arXiv Detail & Related papers (2020-05-11T08:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.