Application of Knowledge Distillation to Multi-task Speech
Representation Learning
- URL: http://arxiv.org/abs/2210.16611v2
- Date: Fri, 19 May 2023 17:16:53 GMT
- Title: Application of Knowledge Distillation to Multi-task Speech
Representation Learning
- Authors: Mine Kerpicci, Van Nguyen, Shuhua Zhang, Erik Visser
- Abstract summary: Speech representation learning models use a large number of parameters, the smallest version of which has 95 million parameters.
In this paper, we investigate the application of knowledge distillation to speech representation learning models followed by fine-tuning.
Our approach results in nearly 75% reduction in model size while suffering only 0.1% accuracy and 0.9% equal error rate degradation.
- Score: 2.0908300719428228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model architectures such as wav2vec 2.0 and HuBERT have been proposed to
learn speech representations from audio waveforms in a self-supervised manner.
When they are combined with downstream tasks such as keyword spotting and
speaker verification, they provide state-of-the-art performance. However, these
models use a large number of parameters, the smallest version of which has 95
million parameters. This constitutes a challenge for edge AI device
deployments. In this paper, we investigate the application of knowledge
distillation to speech representation learning (SRL) models followed by joint
fine-tuning with multiple downstream voice-activated tasks. In our experiments
on two such tasks, our approach results in nearly 75% reduction in model size
while suffering only 0.1% accuracy and 0.9% equal error rate degradation
compared to the full-size model. In addition, we show that fine-tuning the SRL
models results in a significant performance boost compared to using frozen SRL
models.
Related papers
- STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models [10.07318014676215]
We propose to compress the speech SSL models by distilling speech temporal relation (STaR)
Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters.
arXiv Detail & Related papers (2023-12-14T15:37:37Z) - Task-Agnostic Structured Pruning of Speech Representation Models [18.555223754089905]
We propose a fine-grained attention head pruning method to compensate for the performance degradation.
Experiments on the SUPERB benchmark show that our model can achieve comparable performance to the dense model in multiple tasks.
arXiv Detail & Related papers (2023-06-02T09:11:06Z) - MiniSUPERB: Lightweight Benchmark for Self-supervised Speech Models [90.99663022952498]
SuperB was proposed to evaluate the generalizability of self-supervised learning (SSL) speech models across various tasks.
SuperB incurs high computational costs due to the large datasets and diverse tasks.
We introduce MiniSUPERB, a lightweight benchmark that efficiently evaluates SSL speech models with comparable results to SUPERB but lower computational costs significantly.
arXiv Detail & Related papers (2023-05-30T13:07:33Z) - A Light Weight Model for Active Speaker Detection [7.253335671577093]
We construct a lightweight active speaker detection architecture by reducing input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent unit (GRU) with low computational complexity for cross-modal modeling.
Experimental results on the AVA-ActiveSpeaker dataset show that our framework achieves competitive mAP performance (94.1% vs. 94.2%).
Our framework also performs well on the Columbia dataset showing good robustness.
arXiv Detail & Related papers (2023-03-08T08:40:56Z) - LightHuBERT: Lightweight and Configurable Speech Representation Learning
with Once-for-All Hidden-Unit BERT [69.77358429702873]
We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically.
Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures.
LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
arXiv Detail & Related papers (2022-03-29T14:20:55Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Performance-Efficiency Trade-offs in Unsupervised Pre-training for
Speech Recognition [32.61769580342906]
We focus on wav2vec 2.0, and formalize several architecture designs that influence both the model performance and its efficiency.
We introduce SEW (Squeezed and Efficient Wav2vec), a pre-trained model architecture with significant improvements along both performance and efficiency dimensions.
arXiv Detail & Related papers (2021-09-14T17:58:09Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - Simplified Self-Attention for Transformer-based End-to-End Speech
Recognition [56.818507476125895]
We propose a simplified self-attention (SSAN) layer which employs FSMN memory block instead of projection layers to form query and key vectors.
We evaluate the SSAN-based and the conventional SAN-based transformers on the public AISHELL-1, internal 1000-hour and 20,000-hour large-scale Mandarin tasks.
arXiv Detail & Related papers (2020-05-21T04:55:59Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.