Related papers: Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition

Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition

URL: http://arxiv.org/abs/2110.04484v1
Date: Sat, 9 Oct 2021 07:09:22 GMT
Title: Wav2vec-S: Semi-Supervised Pre-Training for Speech Recognition
Authors: Han Zhu, Li Wang, Ying Hou, Jindong Wang, Gaofeng Cheng, Pengyuan Zhang, Yonghong Yan
Abstract summary: Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR) Most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks. We propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap.
Score: 44.347739529374124
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised pre-training has dramatically improved the performance of automatic speech recognition (ASR). However, most existing self-supervised pre-training approaches are task-agnostic, i.e., could be applied to various downstream tasks. And there is a gap between the task-agnostic pre-training and the task-specific downstream fine-tuning, which may degrade the downstream performance. In this work, we propose a novel pre-training paradigm called wav2vec-S, where we use task-specific semi-supervised pre-training to bridge this gap. Specifically, the semi-supervised pre-training is conducted on the basis of self-supervised pre-training such as wav2vec 2.0. Experiments on ASR show that compared to wav2vec 2.0, wav2vec-S only requires marginal increment of pre-training time but could significantly improve ASR performance on in-domain, cross-domain and cross-lingual datasets. The average relative WER reductions are 26.3% and 6.3% for 1h and 10h fine-tuning, respectively.

Related papers

Exploring the Benefit of Activation Sparsity in Pre-training [117.25661020250658]
We study how activation properties change during pre-training. We propose Switchable Sparse-Dense Learning (SSD) SSD achieves comparable performance with identical model size and reduces pre-training costs.
arXiv Detail & Related papers (2024-10-04T13:53:33Z)
SOAR: Self-supervision Optimized UAV Action Recognition with Efficient Object-Aware Pretraining [65.9024395309316]
We introduce a novel Self-supervised pretraining algorithm for aerial footage captured by Unmanned Aerial Vehicles (UAVs) We incorporate human object knowledge throughout the pretraining process to enhance UAV video pretraining efficiency and downstream action recognition performance.
arXiv Detail & Related papers (2024-09-26T21:15:22Z)
Open Implementation and Study of BEST-RQ for Speech Processing [25.678292575349648]
BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has shown great performance on Automatic Speech Recognition (ASR) We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
arXiv Detail & Related papers (2024-05-07T13:11:37Z)
Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition [54.9235160379917]
Stable Distillation is a simple and novel approach for SSL-based continued pre-training. It boosts ASR performance in the target domain where both labeled and unlabeled data are limited.
arXiv Detail & Related papers (2023-12-20T06:02:12Z)
Task-Customized Self-Supervised Pre-training with Scalable Dynamic Routing [76.78772372631623]
A common practice for self-supervised pre-training is to use as much data as possible. For a specific downstream task, however, involving irrelevant data in pre-training may degenerate the downstream performance. It is burdensome and infeasible to use different downstream-task-customized datasets in pre-training for different tasks.
arXiv Detail & Related papers (2022-05-26T10:49:43Z)
On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models. Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z)
How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications [1.3800173438685746]
We study the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases. We benchmark the proposed models on four challenging ATC test sets. We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.
arXiv Detail & Related papers (2022-03-31T06:10:42Z)
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z)
On Scaling Contrastive Representations for Low-Resource Speech Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z)
Joint Masked CPC and CTC Training for ASR [29.41599824919278]
We demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. We show that this joint training method directly optimized performance for the downstream ASR task using unsupervised data.
arXiv Detail & Related papers (2020-10-30T20:28:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.