Related papers: A Method to Reveal Speaker Identity in Distributed ASR Training, and How to Counter It

A Method to Reveal Speaker Identity in Distributed ASR Training, and How to Counter It

URL: http://arxiv.org/abs/2104.07815v1
Date: Thu, 15 Apr 2021 23:15:12 GMT
Title: A Method to Reveal Speaker Identity in Distributed ASR Training, and How to Counter It
Authors: Trung Dang, Om Thakkar, Swaroop Ramaswamy, Rajiv Mathews, Peter Chin, Fran\c{c}oise Beaufays
Abstract summary: We design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient. We show that it is possible to reveal the speaker's identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset.
Score: 3.18475216176047
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: End-to-end Automatic Speech Recognition (ASR) models are commonly trained over spoken utterances using optimization methods like Stochastic Gradient Descent (SGD). In distributed settings like Federated Learning, model training requires transmission of gradients over a network. In this work, we design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient. We propose Hessian-Free Gradients Matching, an input reconstruction technique that operates without second derivatives of the loss function (required in prior works), which can be expensive to compute. We show the effectiveness of our method using the DeepSpeech model architecture, demonstrating that it is possible to reveal the speaker's identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset. Further, we study the effect of two well-known techniques, Differentially Private SGD and Dropout, on the success of our method. We show that a dropout rate of 0.2 can reduce the speaker identity accuracy to 0% top-1 (0.5% top-5).

Related papers

Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models [3.25590215530292]
Keywords Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices.<n>We propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation.<n>We evaluate the proposed approach on the English portion of the Multilingual Spoken Words Corpus (MSWC) and the Google Speech Commands (GSC) datasets.
arXiv Detail & Related papers (2025-06-21T11:39:11Z)
SketchFusion: Learning Universal Sketch Features through Fusing Foundation Models [80.90808879991182]
Drawing on systematic analysis, we uncover two fundamental limitations of foundation models for sketch understanding. We address these limitations by strategically combining SD with CLIP, whose strong semantic understanding naturally compensates for SD's spatial-frequency biases. By dynamically injecting CLIP features into SD's denoising process and adaptively aggregating features across semantic levels, our method achieves state-of-the-art performance in sketch retrieval.
arXiv Detail & Related papers (2025-03-18T10:47:46Z)
Bag of Tricks for Effective Language Model Pretraining and Downstream Adaptation: A Case Study on GLUE [93.98660272309974]
This report briefly describes our submission Vega v1 on the General Language Understanding Evaluation leaderboard. GLUE is a collection of nine natural language understanding tasks, including question answering, linguistic acceptability, sentiment analysis, text similarity, paraphrase detection, and natural language inference. With our optimized pretraining and fine-tuning strategies, our 1.3 billion model sets new state-of-the-art on 4/9 tasks, achieving the best average score of 91.3.
arXiv Detail & Related papers (2023-02-18T09:26:35Z)
Selective In-Context Data Augmentation for Intent Detection using Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model. Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents. Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z)
Guided contrastive self-supervised pre-training for automatic speech recognition [16.038298927903632]
Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model. We present a novel modification of CPC called Guided Contrastive Predictive Coding (GCPC) Our proposed method maximizes the mutual information between representations from a prior-knowledge model and the output of the model being pre-trained, allowing prior knowledge injection during pre-training.
arXiv Detail & Related papers (2022-10-22T02:38:43Z)
Extracting Targeted Training Data from ASR Models, and How to Mitigate It [14.82033976002072]
Noise Masking is a fill-in-the-blank style method for extracting targeted parts of training data from trained ASR models. We show that we are able to extract the correct names from masked training utterances with 11.8% accuracy. We also show that even in a setting that uses synthetic audio and partial transcripts from the test set, our method achieves 2.5% correct name accuracy (47.7% any name success rate)
arXiv Detail & Related papers (2022-04-18T14:43:17Z)
RescoreBERT: Discriminative Speech Recognition Rescoring with BERT [21.763672436079872]
We show how to train a BERT-based rescoring model with MWER loss, to incorporate the improvements of a discriminative loss into fine-tuning of deep bidirectional pretrained models for ASR. We name this approach RescoreBERT, and evaluate it on the LibriSpeech corpus, and it reduces WER by 6.6%/3.4% relative on clean/other test sets over a BERT baseline without discriminative objective.
arXiv Detail & Related papers (2022-02-02T15:45:26Z)
Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR) In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework. Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z)
Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z)
Diverse Knowledge Distillation for End-to-End Person Search [81.4926655119318]
Person search aims to localize and identify a specific person from a gallery of images. Recent methods can be categorized into two groups, i.e., two-step and end-to-end approaches. We propose a simple yet strong end-to-end network with diverse knowledge distillation to break the bottleneck.
arXiv Detail & Related papers (2020-12-21T09:04:27Z)
Gated Recurrent Fusion with Joint Training Framework for Robust End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR. The GRF algorithm is used to dynamically combine the noisy and enhanced features. The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z)
Incremental Learning for End-to-End Automatic Speech Recognition [41.297106772785206]
We propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR) We design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions. Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting.
arXiv Detail & Related papers (2020-05-11T08:18:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.