Federated Learning for ASR based on Wav2vec 2.0
- URL: http://arxiv.org/abs/2302.10790v1
- Date: Mon, 20 Feb 2023 18:36:46 GMT
- Title: Federated Learning for ASR based on Wav2vec 2.0
- Authors: Tuan Nguyen, Salima Mdhaffar, Natalia Tomashenko, Jean-Fran\c{c}ois
Bonastre, Yannick Est\`eve
- Abstract summary: We study the use of federated learning to train an ASR model based on a wav2vec 2.0 model pre-trained by self supervision.
Experiments show that such a model can obtain, with no use of a language model, a word error rate of 10.92% on the official TED-LIUM 3 test set.
We also analyse the ASR performance for speakers depending on their participation to the federated learning.
- Score: 4.711492191554342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a study on the use of federated learning to train an ASR
model based on a wav2vec 2.0 model pre-trained by self supervision. Carried out
on the well-known TED-LIUM 3 dataset, our experiments show that such a model
can obtain, with no use of a language model, a word error rate of 10.92% on the
official TED-LIUM 3 test set, without sharing any data from the different
users. We also analyse the ASR performance for speakers depending to their
participation to the federated learning. Since federated learning was first
introduced for privacy purposes, we also measure its ability to protect speaker
identity. To do that, we exploit an approach to analyze information contained
in exchanged models based on a neural network footprint on an indicator
dataset. This analysis is made layer-wise and shows which layers in an
exchanged wav2vec 2.0 based model bring the speaker identity information.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Interpretable Temporal Class Activation Representation for Audio Spoofing Detection [7.476305130252989]
We utilize the wav2vec 2.0 model and attentive utterance-level features to integrate interpretability directly into the model's architecture.
Our model achieves state-of-the-art results, with an EER of 0.51% and a min t-DCF of 0.0165 on the ASVspoof 2019-LA set.
arXiv Detail & Related papers (2024-06-13T05:36:01Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Convolutional Neural Networks for the classification of glitches in
gravitational-wave data streams [52.77024349608834]
We classify transient noise signals (i.e.glitches) and gravitational waves in data from the Advanced LIGO detectors.
We use models with a supervised learning approach, both trained from scratch using the Gravity Spy dataset.
We also explore a self-supervised approach, pre-training models with automatically generated pseudo-labels.
arXiv Detail & Related papers (2023-03-24T11:12:37Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Training speaker recognition systems with limited data [2.3148470932285665]
This work considers training neural networks for speaker recognition with a much smaller dataset size compared to contemporary work.
We artificially restrict the amount of data by proposing three subsets of the popular VoxCeleb2 dataset.
We show that the self-supervised, pre-trained weights of wav2vec2 substantially improve performance when training data is limited.
arXiv Detail & Related papers (2022-03-28T12:41:41Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Privacy attacks for automatic speech recognition acoustic models in a
federated learning framework [5.1229352884025845]
We propose an approach to analyze information in neural network AMs based on a neural network footprint on the Indicator dataset.
Experiments on the TED-LIUM 3 corpus demonstrate that the proposed approaches are very effective and can provide equal error rate (EER) of 1-2%.
arXiv Detail & Related papers (2021-11-06T02:08:13Z) - Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models.
Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.