DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech
Models
- URL: http://arxiv.org/abs/2305.17651v1
- Date: Sun, 28 May 2023 07:09:33 GMT
- Title: DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech
Models
- Authors: Yifan Peng, Yui Sudo, Shakeel Muhammad, Shinji Watanabe
- Abstract summary: Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment.
We propose DPHuBERT, a novel task-agnostic compression method for speech SSL based on joint distillation and pruning.
- Score: 34.464301065191336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised learning (SSL) has achieved notable success in many speech
processing tasks, but the large model size and heavy computational cost hinder
the deployment. Knowledge distillation trains a small student model to mimic
the behavior of a large teacher model. However, the student architecture
usually needs to be manually designed and will remain fixed during training,
which requires prior knowledge and can lead to suboptimal performance. Inspired
by recent success of task-specific structured pruning, we propose DPHuBERT, a
novel task-agnostic compression method for speech SSL based on joint
distillation and pruning. Experiments on SUPERB show that DPHuBERT outperforms
pure distillation methods in almost all tasks. Moreover, DPHuBERT requires
little training time and performs well with limited training data, making it
suitable for resource-constrained applications. Our method can also be applied
to various speech SSL models. Our code and models will be publicly available.
Related papers
- SKILL: Similarity-aware Knowledge distILLation for Speech
Self-Supervised Learning [14.480769476843886]
We introduce SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network.
Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class.
arXiv Detail & Related papers (2024-02-26T18:56:42Z) - Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with
Academic Compute [40.6786244647298]
Self-supervised learning (SSL) has led to great strides in speech processing.
However, the resources needed to train these models has become prohibitively large.
In this work, we optimize HuBERT SSL to fit in academic constraints.
arXiv Detail & Related papers (2023-06-11T12:53:46Z) - Exploring Effective Distillation of Self-Supervised Speech Models for
Automatic Speech Recognition [5.802425107635222]
Miniaturization for SSL models has become an important research direction of practical value.
We explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR)
A discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios.
arXiv Detail & Related papers (2022-10-27T17:21:14Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Large Language Models Can Be Strong Differentially Private Learners [70.0317718115406]
Differentially Private (DP) learning has seen limited success for building large deep learning models of text.
We show that this performance drop can be mitigated with the use of large pretrained models.
We propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients.
arXiv Detail & Related papers (2021-10-12T01:45:27Z) - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.