Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition
- URL: http://arxiv.org/abs/2110.04484v1
- Date: Sat, 9 Oct 2021 07:09:22 GMT
- Title: Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition
- Authors: Han Zhu, Li Wang, Ying Hou, Jindong Wang, Gaofeng Cheng, Pengyuan
Zhang, Yonghong Yan
- Abstract summary: Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR)
Most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks.
We propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap.
- Score: 44.347739529374124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pre-training has dramatically improved the performance of
automatic speech recognition (ASR). However, most existing self-supervised
pre-training approaches are task-agnostic, i.e., could be applied to various
downstream tasks. And there is a gap between the task-agnostic pre-training and
the task-specific downstream fine-tuning, which may degrade the downstream
performance. In this work, we propose a novel pre-training paradigm called
wav2vec-S, where we use task-specific semi-supervised pre-training to bridge
this gap. Specifically, the semi-supervised pre-training is conducted on the
basis of self-supervised pre-training such as wav2vec 2.0. Experiments on ASR
show that compared to wav2vec 2.0, wav2vec-S only requires marginal increment
of pre-training time but could significantly improve ASR performance on
in-domain, cross-domain and cross-lingual datasets. The average relative WER
reductions are 26.3% and 6.3% for 1h and 10h fine-tuning, respectively.
Related papers
- Exploring the Benefit of Activation Sparsity in Pre-training [117.25661020250658]
We study how activation properties change during pre-training.
We propose Switchable Sparse-Dense Learning (SSD)
SSD achieves comparable performance with identical model size and reduces pre-training costs.
arXiv Detail & Related papers (2024-10-04T13:53:33Z) - SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining [65.9024395309316]
We introduce a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs)
We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance.
arXiv Detail & Related papers (2024-09-26T21:15:22Z) - Open Implementation and Study of BEST-RQ for Speech Processing [25.678292575349648]
BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has shown great performance on Automatic Speech Recognition (ASR)
We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
arXiv Detail & Related papers (2024-05-07T13:11:37Z) - Stable Distillation: Regularizing Continued Pre-training for
Low-Resource Automatic Speech Recognition [54.9235160379917]
Stable Distillation is a simple and novel approach for SSL-based continued pre-training.
It boosts ASR performance in the target domain where both labeled and unlabeled data are limited.
arXiv Detail & Related papers (2023-12-20T06:02:12Z) - Task-Customized Self-Supervised Pre-training with Scalable Dynamic
Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible.
For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance.
It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z) - On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models.
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z) - How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An
Extensive Benchmark on Air Traffic Control Communications [1.3800173438685746]
We study the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases.
We benchmark the proposed models on four challenging ATC test sets.
We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.
arXiv Detail & Related papers (2022-03-31T06:10:42Z) - Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency.
We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z) - On Scaling Contrastive Representations for Low-Resource Speech
Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z) - Joint Masked CPC and CTC Training for ASR [29.41599824919278]
We demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data.
We show that this joint training method directly optimized performance for the downstream ASR task using unsupervised data.
arXiv Detail & Related papers (2020-10-30T20:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.