Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition
- URL: http://arxiv.org/abs/2203.09829v3
- Date: Tue, 11 Apr 2023 18:13:48 GMT
- Title: Representative Subset Selection for Efficient Fine-Tuning in
Self-Supervised Speech Recognition
- Authors: Abdul Hameed Azeemi, Ihsan Ayyub Qazi, Agha Ali Raza
- Abstract summary: We consider the task of identifying an optimal subset of data for efficient fine-tuning in self-supervised speech models for ASR.
We present the COWERAGE algorithm for representative subset selection in self-supervised ASR.
- Score: 6.450618373898492
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised speech recognition models require considerable labeled
training data for learning high-fidelity representations for Automatic Speech
Recognition (ASR) which is computationally demanding and time-consuming. We
consider the task of identifying an optimal subset of data for efficient
fine-tuning in self-supervised speech models for ASR. We discover that the
dataset pruning strategies used in vision tasks for sampling the most
informative examples do not perform better than random subset selection on
fine-tuning self-supervised ASR. We then present the COWERAGE algorithm for
representative subset selection in self-supervised ASR. COWERAGE is based on
our finding that ensuring the coverage of examples based on training Word Error
Rate (WER) in the early training epochs leads to better generalization
performance. Extensive experiments with the wav2vec 2.0 and HuBERT model on
TIMIT, Librispeech, and LJSpeech datasets show the effectiveness of COWERAGE
and its transferability across models, with up to 17% relative WER improvement
over existing dataset pruning methods and random sampling. We also demonstrate
that the coverage of training instances in terms of WER values ensures the
inclusion of phonemically diverse examples, leading to better test accuracy in
self-supervised speech recognition models.
Related papers
- Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs [73.74375912785689]
This paper proposes unified training strategies for speech recognition systems.
We demonstrate that training a single model for all three tasks enhances VSR and AVSR performance.
We also introduce a greedy pseudo-labelling approach to more effectively leverage unlabelled samples.
arXiv Detail & Related papers (2024-11-04T16:46:53Z) - Acoustic Model Optimization over Multiple Data Sources: Merging and Valuation [13.009945735929445]
We propose a novel paradigm to solve salient problems plaguing the Automatic Speech Recognition field.
In the first stage, multiple acoustic models are trained based upon different subsets of the complete speech data.
In the second stage, two novel algorithms are utilized to generate a high-quality acoustic model.
arXiv Detail & Related papers (2024-10-21T03:48:23Z) - Efficient data selection employing Semantic Similarity-based Graph
Structures for model training [1.5845679507219355]
This paper introduces Semantics for data SAliency in Model performance Estimation (SeSaME)
It is an efficient data sampling mechanism solely based on textual information without passing the data through a compute-heavy model.
The application of this approach is demonstrated in the use case of low-resource automated speech recognition (ASR) models.
arXiv Detail & Related papers (2024-02-22T09:43:53Z) - Learning towards Selective Data Augmentation for Dialogue Generation [52.540330534137794]
We argue that not all cases are beneficial for augmentation task, and the cases suitable for augmentation should obey the following two attributes.
We propose a Selective Data Augmentation framework (SDA) for the response generation task.
arXiv Detail & Related papers (2023-03-17T01:26:39Z) - Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised
Speech Models [13.956691231452336]
Self-supervised learning (SSL) has been able to leverage unlabeled data to boost the performance of automatic speech recognition (ASR) models.
Our work investigates different unsupervised data selection techniques for fine-tuning the HuBERT model under a limited transcription budget.
arXiv Detail & Related papers (2022-12-03T18:05:08Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - Listen, Adapt, Better WER: Source-free Single-utterance Test-time
Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples.
Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.