Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings
- URL: http://arxiv.org/abs/2405.13018v1
- Date: Wed, 15 May 2024 06:59:33 GMT
- Title: Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings
- Authors: Ahmed Adel Attia, Dorottya Demszky, Tolulope Ogunremi, Jing Liu, Carol Espy-Wilson,
- Abstract summary: We study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain.
We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%.
- Score: 4.266613351203219
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model's robustness to different noises, microphones, classroom conditions as well as classroom demographics. Our CPT models show improved ability to generalize to different demographics unseen in the labeled finetuning data.
Related papers
- CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments [4.266613351203219]
We study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain.
We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%.
arXiv Detail & Related papers (2024-09-13T19:14:18Z) - Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels [100.43280310123784]
We investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size.
We demonstrate that increasing the size of the training set, a recent trend in the literature, leads to reduced WER despite using noisy transcriptions.
The proposed model achieves new state-of-the-art performance on AV-ASR on LRS2 and LRS3.
arXiv Detail & Related papers (2023-03-25T00:37:34Z) - Improving Low-Resource Speech Recognition with Pretrained Speech Models:
Continued Pretraining vs. Semi-Supervised Training [6.523198497365586]
Self-supervised Transformer based models, such as wav2vec 2.0 and HuBERT, have produced significant improvements over existing approaches to automatic speech recognition (ASR)
We show CoPT results in word error rates (WERs) equal to or slightly better than using semi-supervised training (SST)
In addition, we show that using the CoPT model for pseudo-labeling, and using these labels in SST, results in further improvements in WER.
arXiv Detail & Related papers (2022-07-01T21:02:51Z) - On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models.
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - Exploring Wav2vec 2.0 fine-tuning for improved speech emotion
recognition [78.92428622630861]
wav2vec 2.0 can be used for speech emotion recognition (SER)
Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented.
We show V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset.
We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations.
arXiv Detail & Related papers (2021-10-12T19:55:55Z) - Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency.
We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - On Scaling Contrastive Representations for Low-Resource Speech
Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z) - Pushing the Limits of Semi-Supervised Learning for Automatic Speech
Recognition [97.44056170380726]
We employ a combination of recent developments in semi-supervised learning for automatic speech recognition to obtain state-of-the-art results on LibriSpeech.
We carry out noisy student training with SpecAugment using giant Conformer models pre-trained using wav2vec 2.0 pre-training.
We are able to achieve word-error-rates (WERs) 1.4%/2.6% on the LibriSpeech test/test-other sets against the current state-of-the-art WERs 1.7%/3.3%.
arXiv Detail & Related papers (2020-10-20T17:58:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.