DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT
- URL: http://arxiv.org/abs/2110.01900v2
- Date: Wed, 6 Oct 2021 15:51:03 GMT
- Title: DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT
- Authors: Heng-Jui Chang, Shu-wen Yang, Hung-yi Lee
- Abstract summary: Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
- Score: 69.26447267827454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised speech representation learning methods like wav2vec 2.0 and
Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training and
offer good representations for numerous speech processing tasks. Despite the
success of these methods, they require large memory and high pre-training
costs, making them inaccessible for researchers in academia and small
companies. Therefore, this paper introduces DistilHuBERT, a novel multi-task
learning framework to distill hidden representations from a HuBERT model
directly. This method reduces HuBERT's size by 75% and 73% faster while
retaining most performance in ten different tasks. Moreover, DistilHuBERT
required little training time and data, opening the possibilities of
pre-training personal and on-device SSL models for speech.
Related papers
- GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model [20.620589404103644]
This paper introduces GenDistiller, a novel knowledge distillation framework which generates the hidden representations of the pre-trained teacher model directly by a much smaller student network.
The proposed method takes the previous hidden layer as history and implements a layer-by-layer prediction of the teacher model autoregressively.
Experiments reveal the advantage of GenDistiller over the baseline distilling method without an autoregressive framework, with 33% fewer parameters, similar time consumption and better performance on most of the SUPERB tasks.
arXiv Detail & Related papers (2024-06-12T01:25:00Z) - MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations [43.479279052047985]
MS-HuBERT is an end-to-end self-supervised pre-training method for learning robust speech representations.
It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits.
arXiv Detail & Related papers (2024-06-09T06:30:28Z) - Reproducing Whisper-Style Training Using an Open-Source Toolkit and
Publicly Available Data [75.7383558074758]
This work presents an Open Whisper-style Speech Model (OWSM)
OWSM reproduces Whisper-style training using an open-source toolkit and publicly available data.
We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.
arXiv Detail & Related papers (2023-09-25T05:01:34Z) - DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech
Models [34.464301065191336]
Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment.
We propose DPHuBERT, a novel task-agnostic compression method for speech SSL based on joint distillation and pruning.
arXiv Detail & Related papers (2023-05-28T07:09:33Z) - MelHuBERT: A simplified HuBERT on Mel spectrograms [55.608981341747246]
We revisit the training of HuBERT, a highly successful self-supervised model.
We improve and simplify several key components, including the loss function, input representation, and training in multiple stages.
Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition.
arXiv Detail & Related papers (2022-11-17T23:38:29Z) - LightHuBERT: Lightweight and Configurable Speech Representation Learning
with Once-for-All Hidden-Unit BERT [69.77358429702873]
We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically.
Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures.
LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
arXiv Detail & Related papers (2022-03-29T14:20:55Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.