FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition
- URL: http://arxiv.org/abs/2601.19919v1
- Date: Thu, 08 Jan 2026 08:05:30 GMT
- Title: FastWhisper: Adaptive Self-knowledge Distillation for Real-time Automatic Speech Recognition
- Authors: Junseok Lee, Nahoon Kim, Sangyong Lee, Chang-Jae Chun,
- Abstract summary: We propose adaptive self-knowledge distillation, which reduces the dependence of the teacher model to improve the self-training capacity.<n>FastWhisper achieves a word error rate of 1.07% lower than the teacher model Whisper, and its relative inference time was 5 times faster.
- Score: 3.489980912925397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is one of the most effective methods for model compression. Previous studies have focused on the student model effectively training the predictive distribution of the teacher model. However, during training, the student model may inherit the shortcomings of the teacher model, which can lead to a decline in generalization capacity. To mitigate this issue, we propose adaptive self-knowledge distillation (ASKD), which dynamically reduces the dependence of the teacher model to improve the self-training capacity, and performs the self-knowledge distillation method to improve the generalization capacity of the student model. We further distill the Whisper model into a smaller variant, called FastWhisper. In our post-training setting, FastWhisper achieved a word error rate of 1.07% lower than the teacher model Whisper, and its relative inference time was 5 times faster.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Towards a Smaller Student: Capacity Dynamic Distillation for Efficient
Image Retrieval [49.01637233471453]
Previous Knowledge Distillation based efficient image retrieval methods employs a lightweight network as the student model for fast inference.
We propose a Capacity Dynamic Distillation framework, which constructs a student model with editable representation capacity.
Our method has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher.
arXiv Detail & Related papers (2023-03-16T11:09:22Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Dynamic Rectification Knowledge Distillation [0.0]
Dynamic Rectification Knowledge Distillation (DR-KD) is a knowledge distillation framework.
DR-KD transforms the student into its own teacher, and if the self-teacher makes wrong predictions while distilling information, the error is rectified prior to the knowledge being distilled.
Our proposed DR-KD performs remarkably well in the absence of a sophisticated cumbersome teacher model.
arXiv Detail & Related papers (2022-01-27T04:38:01Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Autoregressive Knowledge Distillation through Imitation Learning [70.12862707908769]
We develop a compression technique for autoregressive models driven by an imitation learning perspective on knowledge distillation.
Our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation.
Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.
arXiv Detail & Related papers (2020-09-15T17:43:02Z) - Real-time Policy Distillation in Deep Reinforcement Learning [11.026828277064293]
Policy distillation is an effective way to transfer control policies from a larger network to a smaller untrained network.
Existing approaches are computationally inefficient, resulting in a long distillation time.
We propose a new distillation mechanism, called real-time policy distillation, in which training the teacher model and distilling the policy to the student model occur simultaneously.
arXiv Detail & Related papers (2019-12-29T11:10:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.