Continual Learning for On-Device Speech Recognition using Disentangled
Conformers
- URL: http://arxiv.org/abs/2212.01393v1
- Date: Fri, 2 Dec 2022 18:58:51 GMT
- Title: Continual Learning for On-Device Speech Recognition using Disentangled
Conformers
- Authors: Anuj Diwan, Ching-Feng Yeh, Wei-Ning Hsu, Paden Tomasello, Eunsol
Choi, David Harwath, Abdelrahman Mohamed
- Abstract summary: We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
- Score: 54.32320258055716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition research focuses on training and evaluating on
static datasets. Yet, as speech models are increasingly deployed on personal
devices, such models encounter user-specific distributional shifts. To simulate
this real-world scenario, we introduce LibriContinual, a continual learning
benchmark for speaker-specific domain adaptation derived from LibriVox
audiobooks, with data corresponding to 118 individual speakers and 6 train
splits per speaker of different sizes. Additionally, current speech recognition
models and continual learning algorithms are not optimized to be
compute-efficient. We adapt a general-purpose training algorithm NetAug for ASR
and create a novel Conformer variant called the DisConformer (Disentangled
Conformer). This algorithm produces ASR models consisting of a frozen 'core'
network for general-purpose use and several tunable 'augment' networks for
speaker-specific tuning. Using such models, we propose a novel
compute-efficient continual learning algorithm called DisentangledCL. Our
experiments show that the DisConformer models significantly outperform
baselines on general ASR i.e. LibriSpeech (15.58% rel. WER on test-other). On
speaker-specific LibriContinual they significantly outperform
trainable-parameter-matched baselines (by 20.65% rel. WER on test) and even
match fully finetuned baselines in some settings.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech [26.533600745910437]
We propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities.
We also propose a new differentiable pruning method that allows the model to automatically learn the thresholds.
arXiv Detail & Related papers (2023-08-28T21:25:05Z) - Training Robust Zero-Shot Voice Conversion Models with Self-supervised
Features [24.182732872327183]
Unsampling Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker.
We show that high-quality audio samples can be achieved by using a length resupervised decoder.
arXiv Detail & Related papers (2021-12-08T17:27:39Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Streaming end-to-end multi-talker speech recognition [34.76106500736099]
We propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker speech recognition.
Our model employs the Recurrent Neural Network Transducer (RNN-T) as the backbone that can meet various latency constraints.
Based on experiments on the publicly available LibriSpeechMix dataset, we show that HEAT can achieve better accuracy compared with PIT.
arXiv Detail & Related papers (2020-11-26T06:28:04Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.