Related papers: Shrinking Bigfoot: Reducing wav2vec 2.0 footprint

Shrinking Bigfoot: Reducing wav2vec 2.0 footprint

URL: http://arxiv.org/abs/2103.15760v2
Date: Thu, 1 Apr 2021 14:57:08 GMT
Title: Shrinking Bigfoot: Reducing wav2vec 2.0 footprint
Authors: Zilun Peng, Akshay Budhkar, Ilana Tuil, Jason Levy, Parinaz Sobhani, Raphael Cohen, Jumana Nassour
Abstract summary: Wav2vec 2.0 is a state-of-the-art speech recognition model. The latency of wav2vec 2.0 will be a bottleneck in production. We explore multiple model compression methods borrowed from the domain of large language models.
Score: 4.708858512006221
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Wav2vec 2.0 is a state-of-the-art speech recognition model which maps speech audio waveforms into latent representations. The largest version of wav2vec 2.0 contains 317 million parameters. Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, leading to high costs and a significant environmental footprint. To improve wav2vec's applicability to a production setting, we explore multiple model compression methods borrowed from the domain of large language models. Using a teacher-student approach, we distilled the knowledge from the original wav2vec 2.0 model into a student model, which is 2 times faster and 4.8 times smaller than the original model. This increase in performance is accomplished with only a 7% degradation in word error rate (WER). Our quantized model is 3.6 times smaller than the original model, with only a 0.1% degradation in WER. To the best of our knowledge, this is the first work that compresses wav2vec 2.0.

Related papers

CPT-Boosted Wav2vec2.0: Towards Noise Robust Speech Recognition for Classroom Environments [4.266613351203219]
We study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%.
arXiv Detail & Related papers (2024-09-13T19:14:18Z)
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [65.30937248905958]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain. WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z)
Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection [57.537583869961885]
Self-supervised speech models are a rapidly developing research topic in fake audio detection. We apply low-rank adaptation(LoRA) to the wav2vec2 model, freezing the pre-trained model weights and injecting a trainable rank-decomposition matrix into each layer of the transformer architecture. Compared with fine-tuning with Adam on the wav2vec2 model containing 317M training parameters, LoRA achieved similar performance by reducing the number of trainable parameters by 198 times.
arXiv Detail & Related papers (2023-06-09T01:43:41Z)
Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition [57.31233839489528]
This paper explores approaches to integrate domain adapted SSL pre-trained models into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition.
arXiv Detail & Related papers (2023-02-28T13:39:17Z)
On-demand compute reduction with stochastic wav2vec 2.0 [63.22845151306881]
We propose compression for on-demand compute reduction for wav2vec 2.0 (W2V2) models. Our results for models pre-trained on 960h Librispeech dataset and fine-tuned on 10h of transcribed data show that using the same model, we get a smooth trade-off between word error rate (WER) and inference time.
arXiv Detail & Related papers (2022-04-25T19:25:46Z)
Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset [0.0]
This paper introduces a deep learning constructed emotional recognition model for Arabic speech dialogues. The developed model employs the state of the art audio representations include wav2vec2.0 and HuBERT. The experiment and performance results of our model overcome the previous known outcomes.
arXiv Detail & Related papers (2021-10-09T00:58:12Z)
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency. We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z)
Wav2vec-C: A Self-supervised Model for Speech Representation Learning [40.47940210640496]
Wav2vec-C is a representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. The proposed self-supervised model is trained on 10k hours of unlabeled data and fine-tuned with 1k hours of labeled data.
arXiv Detail & Related papers (2021-03-09T16:44:45Z)
Exploring wav2vec 2.0 on speaker verification and language identification [9.047596226273495]
Wav2vec 2.0 is a proposed self-supervised framework for speech representation learning. In this work, we attempt to extend wav2vec 2.0 to speaker verification and language identification. For speaker verification, we obtain a new state-of-the-art result, Equal Error Rate (EER) of 3.61% on the VoxCeleb1 dataset. For language identification, we obtain an EER of 12.02% on 1 second condition and an EER of 3.47% on full-length condition of the AP17-OLR dataset.
arXiv Detail & Related papers (2020-12-11T08:22:23Z)
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations [51.25118580050847]
We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned.
arXiv Detail & Related papers (2020-06-20T02:35:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.