Exploring wav2vec 2.0 on speaker verification and language
identification
- URL: http://arxiv.org/abs/2012.06185v2
- Date: Thu, 14 Jan 2021 14:17:22 GMT
- Title: Exploring wav2vec 2.0 on speaker verification and language
identification
- Authors: Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu
- Abstract summary: Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning.
In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification.
For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset.
For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
- Score: 9.047596226273495
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Wav2vec 2.0 is a recently proposed self-supervised framework for speech
representation learning. It follows a two-stage training process of
pre-training and fine-tuning, and performs well in speech recognition tasks
especially ultra-low resource cases. In this work, we attempt to extend
self-supervised framework to speaker verification and language identification.
First, we use some preliminary experiments to indicate that wav2vec 2.0 can
capture the information about the speaker and language. Then we demonstrate the
effectiveness of wav2vec 2.0 on the two tasks respectively. For speaker
verification, we obtain a new state-of-the-art result, Equal Error Rate (EER)
of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain an
EER of 12.02% on 1 second condition and an EER of 3.47% on full-length
condition of the AP17-OLR dataset. Finally, we utilize one model to achieve the
unified modeling by the multi-task learning for the two tasks.
Related papers
- Federated Learning for ASR based on Wav2vec 2.0 [4.711492191554342]
We study the use of federated learning to train an ASR model based on a wav2vec 2.0 model pre-trained by self supervision.
Experiments show that such a model can obtain, with no use of a language model, a word error rate of 10.92% on the official TED-LIUM 3 test set.
We also analyse the ASR performance for speakers depending on their participation to the federated learning.
arXiv Detail & Related papers (2023-02-20T18:36:46Z) - Efficient Self-supervised Learning with Contextualized Target
Representations for Vision, Speech and Language [60.12197397018094]
data2vec is a learning objective that generalizes across several modalities.
We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations.
Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time.
arXiv Detail & Related papers (2022-12-14T22:13:11Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - Robust Speaker Recognition with Transformers Using wav2vec 2.0 [7.419725234099729]
This paper presents an investigation of using wav2vec 2.0 deep speech representations for the speaker recognition task.
It is concluded that Contrastive Predictive Coding pretraining scheme efficiently utilizes the power of unlabeled data.
arXiv Detail & Related papers (2022-03-28T20:59:58Z) - Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based
on BAVED Dataset [0.0]
This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues.
The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT.
The experiment and performance results of our model overcome the previous known outcomes.
arXiv Detail & Related papers (2021-10-09T00:58:12Z) - Multi-task Voice-Activated Framework using Self-supervised Learning [0.9864260997723973]
Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data.
We propose a general purpose framework for adapting a pre-trained wav2vec 2.0 model for different voice-activated tasks.
arXiv Detail & Related papers (2021-10-03T19:28:57Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - On Scaling Contrastive Representations for Low-Resource Speech
Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z) - Pushing the Limits of Semi-Supervised Learning for Automatic Speech
Recognition [97.44056170380726]
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech.
We carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training.
We are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.
arXiv Detail & Related papers (2020-10-20T17:58:13Z) - wav2vec 2.0: A Framework for Self-Supervised Learning of Speech
Representations [51.25118580050847]
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods.
wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
arXiv Detail & Related papers (2020-06-20T02:35:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.