Federated Representation Learning for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2308.02013v2
- Date: Mon, 7 Aug 2023 21:34:44 GMT
- Title: Federated Representation Learning for Automatic Speech Recognition
- Authors: Guruprasad V Ramesh, Gopinath Chennupati, Milind Rao, Anit Kumar Sahu,
Ariya Rastrow, Jasha Droppo
- Abstract summary: Federated Learning (FL) is a privacy-preserving paradigm, allowing edge devices to learn collaboratively without sharing data.
We bring Self-supervised Learning (SSL) and FL together to learn representations for Automatic Speech Recognition respecting data privacy constraints.
We show that the pre-trained ASR encoder in FL performs as well as a centrally pre-trained model and produces an improvement of 12-15% (WER) compared to no pre-training.
- Score: 20.641076546330986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Federated Learning (FL) is a privacy-preserving paradigm, allowing edge
devices to learn collaboratively without sharing data. Edge devices like Alexa
and Siri are prospective sources of unlabeled audio data that can be tapped to
learn robust audio representations. In this work, we bring Self-supervised
Learning (SSL) and FL together to learn representations for Automatic Speech
Recognition respecting data privacy constraints. We use the speaker and chapter
information in the unlabeled speech dataset, Libri-Light, to simulate non-IID
speaker-siloed data distributions and pre-train an LSTM encoder with the
Contrastive Predictive Coding framework with FedSGD. We show that the
pre-trained ASR encoder in FL performs as well as a centrally pre-trained model
and produces an improvement of 12-15% (WER) compared to no pre-training. We
further adapt the federated pre-trained models to a new language, French, and
show a 20% (WER) improvement over no pre-training.
Related papers
- Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Split Federated Learning on Micro-controllers: A Keyword Spotting
Showcase [1.4794135558227681]
Federated Learning is proposed as a private learning scheme, using which users can locally train the model without collecting users' raw data to servers.
In this work, we implement a simply SFL framework on the Arduino board and verify its correctness on the Chinese digits audio dataset for keyword spotting application with over 90% accuracy.
On the English digits audio dataset, our SFL implementation achieves 13.89% higher accuracy compared to a state-of-the-art FL implementation.
arXiv Detail & Related papers (2022-10-04T23:42:45Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Semi-FedSER: Semi-supervised Learning for Speech Emotion Recognition On
Federated Learning using Multiview Pseudo-Labeling [43.17379040854574]
Speech Emotion Recognition (SER) application is frequently associated with privacy concerns.
Federated learning (FL) is a distributed machine learning algorithm that coordinates clients to train a model collaboratively without sharing local data.
In this work, we propose a semi-supervised federated learning framework, Semi-FedSER, that utilizes both labeled and unlabeled data samples to address the challenge of limited data samples in FL.
arXiv Detail & Related papers (2022-03-15T21:50:43Z) - Deciphering Speech: a Zero-Resource Approach to Cross-Lingual Transfer
in ASR [13.726142328715897]
We present a method for cross-lingual training an ASR system using absolutely no transcribed training data from the target language.
Our approach uses a novel application of a decipherment algorithm, which operates given only unpaired speech and text data from the target language.
arXiv Detail & Related papers (2021-11-12T16:16:46Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Federated Self-Training for Semi-Supervised Audio Recognition [0.23633885460047763]
In this work, we study the problem of semi-supervised learning of audio models via self-training.
We propose FedSTAR to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models.
arXiv Detail & Related papers (2021-07-14T17:40:10Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.