wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations
- URL: http://arxiv.org/abs/2006.11477v3
- Date: Thu, 22 Oct 2020 06:09:10 GMT
- Title: wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations
- Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
- Abstract summary: We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods.
wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
- Score: 51.25118580050847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We show for the first time that learning powerful representations from speech
audio alone followed by fine-tuning on transcribed speech can outperform the
best semi-supervised methods while being conceptually simpler. wav2vec 2.0
masks the speech input in the latent space and solves a contrastive task
defined over a quantization of the latent representations which are jointly
learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER
on the clean/other test sets. When lowering the amount of labeled data to one
hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour
subset while using 100 times less labeled data. Using just ten minutes of
labeled data and pre-training on 53k hours of unlabeled data still achieves
4.8/8.2 WER. This demonstrates the feasibility of speech recognition with
limited amounts of labeled data.
Related papers
- Efficient Self-supervised Learning with Contextualized Target
Representations for Vision, Speech and Language [60.12197397018094]
data2vec is a learning objective that generalizes across several modalities.
We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations.
Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time.
arXiv Detail & Related papers (2022-12-14T22:13:11Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Multi-task Voice-Activated Framework using Self-supervised Learning [0.9864260997723973]
Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data.
We propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks.
arXiv Detail & Related papers (2021-10-03T19:28:57Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Wav2vec-C: A Self-supervised Model for Speech Representation Learning [40.47940210640496]
Wav2vec-C is a representation learning technique combining elements from wav2vec 2.0 and VQ-VAE.
The proposed self-supervised model is trained on 10k hours of unlabeled data and fine-tuned with 1k hours of labeled data.
arXiv Detail & Related papers (2021-03-09T16:44:45Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z) - Self-training and Pre-training are Complementary for Speech Recognition [64.85342993297677]
Self-training and unsupervised pre-training have emerged as effective approaches to improve speech recognition systems using unlabeled data.
We show that pseudo-labeling and pre-training with wav2vec 2.0 are complementary in a variety of labeled data setups.
arXiv Detail & Related papers (2020-10-22T04:15:37Z) - Pushing the Limits of Semi-Supervised Learning for Automatic Speech
Recognition [97.44056170380726]
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech.
We carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training.
We are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.
arXiv Detail & Related papers (2020-10-20T17:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.