Unsupervised Speech Recognition
- URL: http://arxiv.org/abs/2105.11084v1
- Date: Mon, 24 May 2021 04:10:47 GMT
- Title: Unsupervised Speech Recognition
- Authors: Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli
- Abstract summary: wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
- Score: 55.864459085947345
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite rapid progress in the recent past, current speech recognition systems
still require labeled training data which limits this technology to a small
fraction of the languages spoken around the globe. This paper describes
wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition
models without any labeled data. We leverage self-supervised speech
representations to segment unlabeled audio and learn a mapping from these
representations to phonemes via adversarial training. The right representations
are key to the success of our method. Compared to the best previous
unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT
benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark,
wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the
best published systems trained on 960 hours of labeled data from only two years
ago. We also experiment on nine other languages, including low-resource
languages such as Kyrgyz, Swahili and Tatar.
Related papers
- Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis [30.97784092953007]
This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition.
TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions.
This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition.
arXiv Detail & Related papers (2024-07-04T16:42:24Z) - GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.
It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - Multilingual Zero Resource Speech Recognition Base on Self-Supervise
Pre-Trained Acoustic Models [14.887781621924255]
This paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition.
It is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts.
Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages.
arXiv Detail & Related papers (2022-10-13T12:11:18Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Applying Wav2vec2.0 to Speech Recognition in Various Low-resource
Languages [16.001329145018687]
In the speech domain, wav2vec2.0 starts to show its powerful representation ability and feasibility of ultra-low resource speech recognition on the Librispeech corpus.
However, wav2vec2.0 has not been examined on real spoken scenarios and languages other than English.
We apply pre-trained models to solve low-resource speech recognition tasks in various spoken languages.
arXiv Detail & Related papers (2020-12-22T15:59:44Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations [51.25118580050847]
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods.
wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
arXiv Detail & Related papers (2020-06-20T02:35:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.