Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition
- URL: http://arxiv.org/abs/2109.06870v1
- Date: Tue, 14 Sep 2021 17:58:09 GMT
- Title: Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition
- Authors: Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q. Weinberger, Yoav
Artzi
- Abstract summary: We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency.
We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
- Score: 32.61769580342906
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper is a study of performance-efficiency trade-offs in pre-trained
models for automatic speech recognition (ASR). We focus on wav2vec 2.0, and
formalize several architecture designs that influence both the model
performance and its efficiency. Putting together all our observations, we
introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model
architecture with significant improvements along both performance and
efficiency dimensions across a variety of training setups. For example, under
the 100h-960h semi-supervised setup on LibriSpeech, SEW achieves a 1.9x
inference speedup compared to wav2vec 2.0, with a 13.5% relative reduction in
word error rate. With a similar inference time, SEW reduces word error rate by
25-50% across different model sizes.
Related papers
- Continued Pretraining for Domain Adaptation of Wav2vec2.0 in Automatic Speech Recognition for Elementary Math Classroom Settings [4.266613351203219]
We study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain.
We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%.
arXiv Detail & Related papers (2024-05-15T06:59:33Z) - Task-Agnostic Structured Pruning of Speech Representation Models [18.555223754089905]
We propose a fine-grained attention head pruning method to compensate for the performance degradation.
Experiments on the SUPERB benchmark show that our model can achieve comparable performance to the dense model in multiple tasks.
arXiv Detail & Related papers (2023-06-02T09:11:06Z) - ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders [104.05133094625137]
We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer.
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
arXiv Detail & Related papers (2023-01-02T18:59:31Z) - Efficient Self-supervised Learning with Contextualized Target
Representations for Vision, Speech and Language [60.12197397018094]
data2vec is a learning objective that generalizes across several modalities.
We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations.
Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time.
arXiv Detail & Related papers (2022-12-14T22:13:11Z) - Application of Knowledge Distillation to Multi-task Speech
Representation Learning [2.0908300719428228]
Speech representation learning models use a large number of parameters, the smallest version of which has 95 million parameters.
In this paper, we investigate the application of knowledge distillation to speech representation learning models followed by fine-tuning.
Our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation.
arXiv Detail & Related papers (2022-10-29T14:22:43Z) - On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models.
Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z) - Exploring Wav2vec 2.0 fine-tuning for improved speech emotion
recognition [78.92428622630861]
wav2vec 2.0 can be used for speech emotion recognition (SER)
Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented.
We show V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset.
We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations.
arXiv Detail & Related papers (2021-10-12T19:55:55Z) - On Scaling Contrastive Representations for Low-Resource Speech
Recognition [12.447872366013224]
We train a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework.
We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor.
arXiv Detail & Related papers (2021-02-01T13:58:02Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.