Improving Low-Resource Speech Recognition with Pretrained Speech Models:
Continued Pretraining vs. Semi-Supervised Training
- URL: http://arxiv.org/abs/2207.00659v1
- Date: Fri, 1 Jul 2022 21:02:51 GMT
- Title: Improving Low-Resource Speech Recognition with Pretrained Speech Models:
Continued Pretraining vs. Semi-Supervised Training
- Authors: Mitchell DeHaven, Jayadev Billa
- Abstract summary: Self-supervised Transformer based models, such as wav2vec 2.0 and HuBERT, have produced significant improvements over existing approaches to automatic speech recognition (ASR)
We show CoPT results in word error rates (WERs) equal to or slightly better than using semi-supervised training (SST)
In addition, we show that using the CoPT model for pseudo-labeling, and using these labels in SST, results in further improvements in WER.
- Score: 6.523198497365586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised Transformer based models, such as wav2vec 2.0 and HuBERT,
have produced significant improvements over existing approaches to automatic
speech recognition (ASR). This is evident in the performance of the wav2vec 2.0
based pretrained XLSR-53 model across many languages when fine-tuned with
available labeled data. However, the performance from finetuning these models
can be dependent on the amount of in-language or similar-to-in-language data
included in the pretraining dataset. In this paper we investigate continued
pretraining (CoPT) with unlabeled in-language audio data on the XLSR-53
pretrained model in several low-resource languages. CoPT is more
computationally efficient than semi-supervised training (SST), the standard
approach of utilizing unlabeled data in ASR, since it omits the need for
pseudo-labeling of the unlabeled data. We show CoPT results in word error rates
(WERs), equal to or slightly better than using SST. In addition, we show that
using the CoPT model for pseudo-labeling, and using these labels in SST,
results in further improvements in WER.
Related papers
- Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Self-supervised Adaptive Pre-training of Multilingual Speech Models for
Language and Dialect Identification [19.893213508284813]
Self-supervised adaptive pre-training is proposed to adapt the pre-trained model to the target domain and languages of the downstream task.
We show that SAPT improves XLSR performance on the FLEURS benchmark with substantial gains up to 40.1% for under-represented languages.
arXiv Detail & Related papers (2023-12-12T14:58:08Z) - Simple and Effective Unsupervised Speech Translation [68.25022245914363]
We study a simple and effective approach to build speech translation systems without labeled data.
We present an unsupervised domain adaptation technique for pre-trained speech models.
Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art.
arXiv Detail & Related papers (2022-10-18T22:26:13Z) - ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource
Speech Translation Tasks [8.651248939672769]
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2022: low-resource and dialect speech translation.
We build an end-to-end model as our joint primary submission, and compare it against cascaded models that leverage a large fine-tuned wav2vec 2.0 model for ASR.
Our results highlight that self-supervised models trained on smaller sets of target data are more effective to low-resource end-to-end ST fine-tuning, compared to large off-the-shelf models.
arXiv Detail & Related papers (2022-05-04T10:36:57Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - Improving Neural Machine Translation by Denoising Training [95.96569884410137]
We present a simple and effective pretraining strategy Denoising Training DoT for neural machine translation.
We update the model parameters with source- and target-side denoising tasks at the early stage and then tune the model normally.
Experiments show DoT consistently improves the neural machine translation performance across 12 bilingual and 16 multilingual directions.
arXiv Detail & Related papers (2022-01-19T00:11:38Z) - Improving Neural Machine Translation by Bidirectional Training [85.64797317290349]
We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation.
Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally.
Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs significantly higher.
arXiv Detail & Related papers (2021-09-16T07:58:33Z) - Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition [55.362258027878966]
We present momentum pseudo-labeling (MPL) as a simple yet effective strategy for semi-supervised speech recognition.
MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.
The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios.
arXiv Detail & Related papers (2021-06-16T16:24:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.