Improving Unsupervised Sparsespeech Acoustic Models with Categorical
Reparameterization
- URL: http://arxiv.org/abs/2005.14578v1
- Date: Fri, 29 May 2020 13:58:36 GMT
- Title: Improving Unsupervised Sparsespeech Acoustic Models with Categorical
Reparameterization
- Authors: Benjamin Milde, Chris Biemann
- Abstract summary: We extend the Sparsespeech model to allow for sampling over a random variable, yielding pseudo-posteriorgrams.
The new and improved model is trained and evaluated on the Libri-Light corpus, a benchmark for ASR with limited or no supervision.
We observe a relative improvement of up to 31.4% on ABX error rates across speakers on the test set with the improved model.
- Score: 31.977418525076626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Sparsespeech model is an unsupervised acoustic model that can generate
discrete pseudo-labels for untranscribed speech. We extend the Sparsespeech
model to allow for sampling over a random discrete variable, yielding
pseudo-posteriorgrams. The degree of sparsity in this posteriorgram can be
fully controlled after the model has been trained. We use the Gumbel-Softmax
trick to approximately sample from a discrete distribution in the neural
network and this allows us to train the network efficiently with standard
backpropagation. The new and improved model is trained and evaluated on the
Libri-Light corpus, a benchmark for ASR with limited or no supervision. The
model is trained on 600h and 6000h of English read speech. We evaluate the
improved model using the ABX error measure and a semi-supervised setting with
10h of transcribed speech. We observe a relative improvement of up to 31.4% on
ABX error rates across speakers on the test set with the improved Sparsespeech
model on 600h of speech data and further improvements when we scale the model
to 6000h.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo
Labelling [75.74809713084282]
Distil-Whisper is designed to be paired with Whisper for speculative decoding, yielding a 2 times speed-up.
Distil-Whisper is 5.8 times faster with 51% fewer parameters, while performing to within 1% WER on out-of-distribution test data.
To facilitate further research in this domain, we make our training code, inference code and models publicly accessible.
arXiv Detail & Related papers (2023-11-01T10:45:07Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - Robust Speech Recognition via Large-Scale Weak Supervision [69.63329359286419]
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks.
We are releasing models and inference code to serve as a foundation for further work on robust speech processing.
arXiv Detail & Related papers (2022-12-06T18:46:04Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Prediction of speech intelligibility with DNN-based performance measures [9.883633991083789]
This paper presents a speech intelligibility model based on automatic speech recognition (ASR)
It combines phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these probabilities.
The proposed model performs almost as well as the label-based model and produces more accurate predictions than the baseline models.
arXiv Detail & Related papers (2022-03-17T08:05:38Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z) - Scaling ASR Improves Zero and Few Shot Learning [23.896440724468246]
We propose data selection techniques to efficiently scale training data to find the most valuable samples in massive datasets.
By training 1-10B parameter universal English ASR models, we push the limits of speech recognition performance across many domains.
For speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement on the AphasiaBank test set, respectively.
arXiv Detail & Related papers (2021-11-10T21:18:59Z) - Personalized Speech Enhancement through Self-Supervised Data
Augmentation and Purification [24.596224536399326]
We train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources.
We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data.
arXiv Detail & Related papers (2021-04-05T17:17:55Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Attention based on-device streaming speech recognition with large speech
corpus [16.702653972113023]
We present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus.
We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses.
For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.
arXiv Detail & Related papers (2020-01-02T04:24:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.