Wav2vec-C: A Self-supervised Model for Speech Representation Learning
- URL: http://arxiv.org/abs/2103.08393v1
- Date: Tue, 9 Mar 2021 16:44:45 GMT
- Title: Wav2vec-C: A Self-supervised Model for Speech Representation Learning
- Authors: Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu,
Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas
- Abstract summary: Wav2vec-C is a representation learning technique combining elements from wav2vec 2.0 and VQ-VAE.
The proposed self-supervised model is trained on 10k hours of unlabeled data and fine-tuned with 1k hours of labeled data.
- Score: 40.47940210640496
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Wav2vec-C introduces a novel representation learning technique combining
elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized
representations from partially masked speech encoding using a contrastive loss
in a way similar to Wav2vec 2.0. However, the quantization process is
regularized by an additional consistency network that learns to reconstruct the
input features to the wav2vec 2.0 network from the quantized representations in
a way similar to a VQ-VAE model. The proposed self-supervised model is trained
on 10k hours of unlabeled data and subsequently used as the speech encoder in a
RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one
of only a few studies of self-supervised learning on speech tasks with a large
volume of real far-field labeled data. The Wav2vec-C encoded representations
achieves, on average, twice the error reduction over baseline and a higher
codebook utilization in comparison to wav2vec 2.0
Related papers
- Efficient Self-supervised Learning with Contextualized Target
Representations for Vision, Speech and Language [60.12197397018094]
data2vec is a learning objective that generalizes across several modalities.
We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations.
Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time.
arXiv Detail & Related papers (2022-12-14T22:13:11Z) - On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models.
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Shrinking Bigfoot: Reducing wav2vec 2.0 footprint [4.708858512006221]
Wav2vec 2.0 is a state-of-the-art speech recognition model.
The latency of wav2vec 2.0 will be a bottleneck in production.
We explore multiple model compression methods borrowed from the domain of large language models.
arXiv Detail & Related papers (2021-03-29T16:50:28Z) - Exploring wav2vec 2.0 on speaker verification and language
identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z) - wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations [51.25118580050847]
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods.
wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
arXiv Detail & Related papers (2020-06-20T02:35:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.