FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech
Self-Supervised Learning
- URL: http://arxiv.org/abs/2207.00555v1
- Date: Fri, 1 Jul 2022 17:11:23 GMT
- Title: FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech
Self-Supervised Learning
- Authors: Yeonghyeon Lee, Kangwook Jang, Jahyun Goo, Youngmoon Jung, Hoirin Kim
- Abstract summary: We propose FitHuBERT, which makes thinner in dimension throughout almost all model components and deeper in layer compared to prior speech SSL distillation works.
Our method reduces the model to 23.8% in size and 35.9% in inference time compared to HuBERT.
Also, we achieve 12.1% word error rate and 13.3% phoneme error rate on the SUPERB benchmark which is superior than prior work.
- Score: 12.561034842067887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale speech self-supervised learning (SSL) has emerged to the main
field of speech processing, however, the problem of computational cost arising
from its vast size makes a high entry barrier to academia. In addition,
existing distillation techniques of speech SSL models compress the model by
reducing layers, which induces performance degradation in linguistic pattern
recognition tasks such as phoneme recognition (PR). In this paper, we propose
FitHuBERT, which makes thinner in dimension throughout almost all model
components and deeper in layer compared to prior speech SSL distillation works.
Moreover, we employ a time-reduction layer to speed up inference time and
propose a method of hint-based distillation for less performance degradation.
Our method reduces the model to 23.8% in size and 35.9% in inference time
compared to HuBERT. Also, we achieve 12.1% word error rate and 13.3% phoneme
error rate on the SUPERB benchmark which is superior than prior work.
Related papers
- STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models [10.07318014676215]
We propose to compress the speech SSL models by distilling speech temporal relation (STaR)
Our model distilled from HuBERT BASE achieves an overall score of 79.8 on SUPERB benchmark, the best performance among models with up to 27 million parameters.
arXiv Detail & Related papers (2023-12-14T15:37:37Z) - Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech
Representation Learning [2.120033481952703]
Speech-based SSL models face a common dilemma in terms of computational cost.
Fast-HuBERT can be trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without performance degradation.
arXiv Detail & Related papers (2023-09-25T04:07:34Z) - MelHuBERT: A simplified HuBERT on Mel spectrograms [55.608981341747246]
We revisit the training of HuBERT, a highly successful self-supervised model.
We improve and simplify several key components, including the loss function, input representation, and training in multiple stages.
Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition.
arXiv Detail & Related papers (2022-11-17T23:38:29Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - Exploring Effective Distillation of Self-Supervised Speech Models for
Automatic Speech Recognition [5.802425107635222]
Miniaturization for SSL models has become an important research direction of practical value.
We explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR)
A discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios.
arXiv Detail & Related papers (2022-10-27T17:21:14Z) - LightHuBERT: Lightweight and Configurable Speech Representation Learning
with Once-for-All Hidden-Unit BERT [69.77358429702873]
We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically.
Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures.
LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
arXiv Detail & Related papers (2022-03-29T14:20:55Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z) - HuBERT: Self-Supervised Speech Representation Learning by Masked
Prediction of Hidden Units [81.53783563025084]
We propose an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
A key ingredient of our approach is applying the prediction loss over the masked regions only.
HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
arXiv Detail & Related papers (2021-06-14T14:14:28Z) - Exploiting Large-scale Teacher-Student Training for On-device Acoustic
Models [15.237992590162593]
We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM)
We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR)
We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting.
arXiv Detail & Related papers (2021-06-11T02:23:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.