SKILL: Similarity-aware Knowledge distILLation for Speech
Self-Supervised Learning
- URL: http://arxiv.org/abs/2402.16830v1
- Date: Mon, 26 Feb 2024 18:56:42 GMT
- Title: SKILL: Similarity-aware Knowledge distILLation for Speech
Self-Supervised Learning
- Authors: Luca Zampierin, Ghouthi Boukli Hacene, Bac Nguyen, Mirco Ravanelli
- Abstract summary: We introduce SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network.
Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class.
- Score: 14.480769476843886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) has achieved remarkable success across various
speech-processing tasks. To enhance its efficiency, previous works often
leverage the use of compression techniques. A notable recent attempt is
DPHuBERT, which applies joint knowledge distillation (KD) and structured
pruning to learn a significantly smaller SSL model. In this paper, we
contribute to this research domain by introducing SKILL, a novel method that
conducts distillation across groups of layers instead of distilling individual
arbitrarily selected layers within the teacher network. The identification of
the layers to distill is achieved through a hierarchical clustering procedure
applied to layer similarity measures. Extensive experiments demonstrate that
our distilled version of WavLM Base+ not only outperforms DPHuBERT but also
achieves state-of-the-art results in the 30M parameters model class across
several SUPERB tasks.
Related papers
- CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination [28.061239778773423]
Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks.
CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources.
We introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model.
arXiv Detail & Related papers (2024-08-18T11:23:21Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Structural Knowledge Distillation for Object Detection [2.690687471762886]
We propose a replacement for the pixel-wise independent lp-norm based on the structural similarity (SSIM)
Our method adds only little computational overhead, is straightforward to implement and at the same time it significantly outperforms the standard lp-norms.
arXiv Detail & Related papers (2022-11-23T17:06:52Z) - Weighted Ensemble Self-Supervised Learning [67.24482854208783]
Ensembling has proven to be a powerful technique for boosting model performance.
We develop a framework that permits data-dependent weighted cross-entropy losses.
Our method outperforms both in multiple evaluation metrics on ImageNet-1K.
arXiv Detail & Related papers (2022-11-18T02:00:17Z) - RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation [24.951887361152988]
We propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model.
We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.
arXiv Detail & Related papers (2021-09-21T13:21:13Z) - Improving Task-Agnostic BERT Distillation with Layer Mapping Search [43.7650740369353]
We show that layer-level supervision is crucial to the performance of the student BERT model.
In this paper, we propose to use the genetic algorithm (GA) to search for the optimal layer mapping automatically.
After obtaining the optimal layer mapping, we perform the task-agnostic BERT distillation with it on the whole corpus to build a compact student model.
arXiv Detail & Related papers (2020-12-11T06:29:58Z) - Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.