data2vec-aqc: Search for the right Teaching Assistant in the
Teacher-Student training setup
- URL: http://arxiv.org/abs/2211.01246v2
- Date: Sat, 13 May 2023 21:16:36 GMT
- Title: data2vec-aqc: Search for the right Teaching Assistant in the
Teacher-Student training setup
- Authors: Vasista Sai Lodagala and Sreyan Ghosh and S. Umesh
- Abstract summary: We propose a new Self-Supervised Learning (SSL) algorithm called data2vec-aqc.
Our goal is to improve SSL for speech in domains where both unlabeled and labeled data are limited.
- Score: 1.2031796234206138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new Self-Supervised Learning (SSL) algorithm
called data2vec-aqc, for speech representation learning from unlabeled speech
data. Our goal is to improve SSL for speech in domains where both unlabeled and
labeled data are limited. Building on the recently introduced data2vec, we
introduce additional modules to the data2vec framework that leverage the
benefit of data augmentations, quantized representations, and clustering. The
interaction between these modules helps solve the cross-contrastive loss as an
additional self-supervised objective. data2vec-aqc achieves up to 14.1% and
20.9% relative WER improvement over the existing state-of-the-art data2vec
system over the test-clean and test-other sets, respectively of LibriSpeech,
without the use of any language model (LM). Our proposed model also achieves up
to 17.8\% relative WER gains over the baseline data2vec when fine-tuned on a
subset of the Switchboard dataset. Code:
https://github.com/Speech-Lab-IITM/data2vec-aqc.
Related papers
- MaskMatch: Boosting Semi-Supervised Learning Through Mask Autoencoder-Driven Feature Learning [8.255082589733673]
algo is a novel algorithm that fully utilizes unlabeled data to boost semi-supervised learning.
algo integrates a self-supervised learning strategy, i.e., Masked Autoencoder (MAE), that uses all available data to enforce the visual representation learning.
algo achieves low error rates of 18.71%, 9.47%, and 3.07%, respectively, on challenging datasets.
arXiv Detail & Related papers (2024-05-10T03:39:54Z) - Mispronunciation detection using self-supervised speech representations [10.010024759851142]
We study the use of SSL models for the task of mispronunciation detection for second language learners.
We compare two downstream approaches: 1) training the model for phone recognition using native English data, and 2) training a model directly for the target task using non-native English data.
arXiv Detail & Related papers (2023-07-30T21:20:58Z) - Efficient Self-supervised Learning with Contextualized Target
Representations for Vision, Speech and Language [60.12197397018094]
data2vec is a learning objective that generalizes across several modalities.
We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations.
Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time.
arXiv Detail & Related papers (2022-12-14T22:13:11Z) - More Speaking or More Speakers? [17.143456510764576]
Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR)
In this work we aim to analyse the effect of numbers of speakers in the training data on a recent SSL algorithm (wav2vec 2.0) and a recent ST algorithm (slimIPL)
Our findings suggest that SSL requires a large amount of unlabeled data to produce high accuracy results, while ST requires a sufficient number of speakers in the labelled data, especially in the low-regime setting.
arXiv Detail & Related papers (2022-11-02T03:50:40Z) - CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-supervised
learning of speech representations [1.2031796234206138]
We present a new pre-training strategy named ccc-wav2vec 2.0, which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective.
ccc-wav2vec 2.0 achieves up to 15.6% and 12.7% relative WER improvement over the baseline wav2vec 2.0 on the test-clean and test-other sets, respectively, of LibriSpeech, without the use of any language model.
arXiv Detail & Related papers (2022-10-05T22:44:35Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - W2v-BERT: Combining Contrastive Learning and Masked Language Modeling
for Self-Supervised Speech Pre-Training [49.47516627019855]
w2v-BERT is a framework that combines contrastive learning and pre-supervised speech learning.
Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models.
arXiv Detail & Related papers (2021-08-07T06:29:36Z) - Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised
Discrete Speech Representations [49.55361944105796]
We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence framework.
A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker.
arXiv Detail & Related papers (2020-10-23T08:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.