Stable Distillation: Regularizing Continued Pre-training for
Low-Resource Automatic Speech Recognition
- URL: http://arxiv.org/abs/2312.12783v1
- Date: Wed, 20 Dec 2023 06:02:12 GMT
- Title: Stable Distillation: Regularizing Continued Pre-training for
Low-Resource Automatic Speech Recognition
- Authors: Ashish Seth and Sreyan Ghosh and S. Umesh and Dinesh Manocha
- Abstract summary: Stable Distillation is a simple and novel approach for SSL-based continued pre-training.
It boosts ASR performance in the target domain where both labeled and unlabeled data are limited.
- Score: 54.9235160379917
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Continued self-supervised (SSL) pre-training for adapting existing SSL models
to the target domain has shown to be extremely effective for low-resource
Automatic Speech Recognition (ASR). This paper proposes Stable Distillation, a
simple and novel approach for SSL-based continued pre-training that boosts ASR
performance in the target domain where both labeled and unlabeled data are
limited. Stable Distillation employs self-distillation as regularization for
continued pre-training, alleviating the over-fitting issue, a common problem
continued pre-training faces when the source and target domains differ.
Specifically, first, we perform vanilla continued pre-training on an initial
SSL pre-trained model on the target domain ASR dataset and call it the teacher.
Next, we take the same initial pre-trained model as a student to perform
continued pre-training while enforcing its hidden representations to be close
to that of the teacher (via MSE loss). This student is then used for downstream
ASR fine-tuning on the target dataset. In practice, Stable Distillation
outperforms all our baselines by 0.8 - 7 WER when evaluated in various
experimental settings.
Related papers
- FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous
Self-Supervised Learning [54.9235160379917]
FusDom is a simple and novel methodology for SSL-based continued pre-training.
FusDom learns speech representations that are robust and adaptive yet not forgetful of concepts seen in the past.
arXiv Detail & Related papers (2023-12-20T13:50:05Z) - Progressive Feature Adjustment for Semi-supervised Learning from
Pretrained Models [39.42802115580677]
Semi-supervised learning (SSL) can leverage both labeled and unlabeled data to build a predictive model.
Recent literature suggests that naively applying state-of-the-art SSL with a pretrained model fails to unleash the full potential of training data.
We propose to use pseudo-labels from the unlabelled data to update the feature extractor that is less sensitive to incorrect labels.
arXiv Detail & Related papers (2023-09-09T01:57:14Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z) - Analyzing the factors affecting usefulness of Self-Supervised
Pre-trained Representations for Speech Recognition [1.0705399532413615]
Self-supervised learning (SSL) to learn high-level speech representations has been a popular approach to building Automatic Speech Recognition systems.
We study the effect of domain, language, dataset size, and other aspects of our upstream pre-training SSL data on the final performance low-resource downstream ASR task.
arXiv Detail & Related papers (2022-03-31T11:48:24Z) - Two-phase Pseudo Label Densification for Self-training based Domain
Adaptation [93.03265290594278]
We propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD.
In the first phase, we use sliding window voting to propagate the confident predictions, utilizing intrinsic spatial-correlations in the images.
In the second phase, we perform a confidence-based easy-hard classification.
To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss.
arXiv Detail & Related papers (2020-12-09T02:35:25Z) - Unsupervised Domain Adaptation for Speech Recognition via Uncertainty
Driven Self-Training [55.824641135682725]
Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD show that up to 80% of the performance of a system trained on ground-truth data can be recovered.
arXiv Detail & Related papers (2020-11-26T18:51:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.