Multi-stage Progressive Compression of Conformer Transducer for
On-device Speech Recognition
- URL: http://arxiv.org/abs/2210.00169v1
- Date: Sat, 1 Oct 2022 02:23:00 GMT
- Title: Multi-stage Progressive Compression of Conformer Transducer for
On-device Speech Recognition
- Authors: Jash Rathod, Nauman Dawalatabad, Shatrughan Singh, Dhananjaya Gowda
- Abstract summary: Small memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models.
Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size.
We propose a multi-stage progressive approach to compress the conformer transducer model using KD.
- Score: 7.450574974954803
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The smaller memory bandwidth in smart devices prompts development of smaller
Automatic Speech Recognition (ASR) models. To obtain a smaller model, one can
employ the model compression techniques. Knowledge distillation (KD) is a
popular model compression approach that has shown to achieve smaller model size
with relatively lesser degradation in the model performance. In this approach,
knowledge is distilled from a trained large size teacher model to a smaller
size student model. Also, the transducer based models have recently shown to
perform well for on-device streaming ASR task, while the conformer models are
efficient in handling long term dependencies. Hence in this work we employ a
streaming transducer architecture with conformer as the encoder. We propose a
multi-stage progressive approach to compress the conformer transducer model
using KD. We progressively update our teacher model with the distilled student
model in a multi-stage setup. On standard LibriSpeech dataset, our experimental
results have successfully achieved compression rates greater than 60% without
significant degradation in the performance compared to the larger teacher
model.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Tiny Models are the Computational Saver for Large Models [1.8350044465969415]
This paper introduces TinySaver, an early-exit-like dynamic model compression approach which employs tiny models to substitute large models adaptively.
Our evaluation of this approach in ImageNet-1k classification demonstrates its potential to reduce the number of compute operations by up to 90%, with only negligible losses in performance.
arXiv Detail & Related papers (2024-03-26T14:14:30Z) - Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Co-training and Co-distillation for Quality Improvement and Compression
of Language Models [88.94539115180919]
Knowledge Distillation (KD) compresses expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models.
Most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed.
We propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models.
arXiv Detail & Related papers (2023-11-06T03:29:00Z) - Ultra Fast Speech Separation Model with Teacher Student Learning [44.71171732510265]
An ultra fast Transformer model is proposed to achieve better performance and efficiency with teacher student learning (T-S learning)
Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation.
arXiv Detail & Related papers (2022-04-27T09:02:45Z) - A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes [54.83802872236367]
We propose a dynamic cascaded encoder Automatic Speech Recognition (ASR) model, which unifies models for different deployment scenarios.
The proposed large-medium model has 30% smaller size and reduces power consumption by 33%, compared to the baseline cascaded encoder model.
The triple-size model that unifies the large, medium, and small models achieves 37% total size reduction with minimal quality loss.
arXiv Detail & Related papers (2022-04-13T04:15:51Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Self-Supervised GAN Compression [32.21713098893454]
We show that a standard model compression technique, weight pruning, cannot be applied to GANs using existing methods.
We then develop a self-supervised compression technique which uses the trained discriminator to supervise the training of a compressed generator.
We show that this framework has a compelling performance to high degrees of sparsity, can be easily applied to new tasks and models, and enables meaningful comparisons between different pruning granularities.
arXiv Detail & Related papers (2020-07-03T04:18:54Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.