Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition
- URL: http://arxiv.org/abs/2110.04484v1
- Date: Sat, 9 Oct 2021 07:09:22 GMT
- Title: Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition
- Authors: Han Zhu, Li Wang, Ying Hou, Jindong Wang, Gaofeng Cheng, Pengyuan
Zhang, Yonghong Yan
- Abstract summary: Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR)
Most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks.
We propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap.
- Score: 44.347739529374124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pre-training has dramatically improved the performance of
automatic speech recognition (ASR). However, most existing self-supervised
pre-training approaches are task-agnostic, i.e., could be applied to various
downstream tasks. And there is a gap between the task-agnostic pre-training and
the task-specific downstream fine-tuning, which may degrade the downstream
performance. In this work, we propose a novel pre-training paradigm called
wav2vec-S, where we use task-specific semi-supervised pre-training to bridge
this gap. Specifically, the semi-supervised pre-training is conducted on the
basis of self-supervised pre-training such as wav2vec 2.0. Experiments on ASR
show that compared to wav2vec 2.0, wav2vec-S only requires marginal increment
of pre-training time but could significantly improve ASR performance on
in-domain, cross-domain and cross-lingual datasets. The average relative WER
reductions are 26.3% and 6.3% for 1h and 10h fine-tuning, respectively.
Related papers
- Open Implementation and Study of BEST-RQ for Speech Processing [25.678292575349648]
BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has shown great performance on Automatic Speech Recognition (ASR)
We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
arXiv Detail & Related papers (2024-05-07T13:11:37Z) - Efficient Stagewise Pretraining via Progressive Subnetworks [55.65819977062729]
We propose an alternative framework, progressive subnetwork training, that maintains the full model throughout training, but only trainsworks within the model in each step.
RaPTr achieves better pre-training loss for BERT and UL2 language models while requiring 20-33% fewer FLOPs compared to standard training, and is competitive or better than other efficient training methods.
arXiv Detail & Related papers (2024-02-08T18:49:09Z) - Revisiting the Power of Prompt for Visual Tuning [50.11465784194896]
This study explores the correlation evolvement between prompts and patch tokens during proficient training.
Inspired by the observation that the prompt tokens tend to share high mutual information with patch tokens, we propose initializing prompts with downstream token prototypes.
Our method significantly advances the adaptation for self-supervised pretraining, achieving impressive task performance gains of at least 10% to 30%.
arXiv Detail & Related papers (2024-02-04T07:49:02Z) - Stable Distillation: Regularizing Continued Pre-training for
Low-Resource Automatic Speech Recognition [54.9235160379917]
Stable Distillation is a simple and novel approach for SSL-based continued pre-training.
It boosts ASR performance in the target domain where both labeled and unlabeled data are limited.
arXiv Detail & Related papers (2023-12-20T06:02:12Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z) - On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models.
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z) - How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An
Extensive Benchmark on Air Traffic Control Communications [1.3800173438685746]
We study the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases.
We benchmark the proposed models on four challenging ATC test sets.
We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.
arXiv Detail & Related papers (2022-03-31T06:10:42Z) - Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency.
We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z) - On Scaling Contrastive Representations for Low-Resource Speech
Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z) - Joint Masked CPC and CTC Training for ASR [29.41599824919278]
We demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data.
We show that this joint training method directly optimized performance for the downstream ASR task using unsupervised data.
arXiv Detail & Related papers (2020-10-30T20:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.