MelHuBERT: A simplified HuBERT on Mel spectrograms
- URL: http://arxiv.org/abs/2211.09944v3
- Date: Thu, 29 Aug 2024 19:25:59 GMT
- Title: MelHuBERT: A simplified HuBERT on Mel spectrograms
- Authors: Tzu-Quan Lin, Hung-yi Lee, Hao Tang,
- Abstract summary: We revisit the training of HuBERT, a highly successful self-supervised model.
We improve and simplify several key components, including the loss function, input representation, and training in multiple stages.
Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition.
- Score: 55.608981341747246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.
Related papers
- MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations [43.479279052047985]
MS-HuBERT is an end-to-end self-supervised pre-training method for learning robust speech representations.
It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits.
arXiv Detail & Related papers (2024-06-09T06:30:28Z) - OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification [44.94458898538114]
We propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC)
It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID)
Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference.
arXiv Detail & Related papers (2024-02-20T02:04:38Z) - Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with
Academic Compute [40.6786244647298]
Self-supervised learning (SSL) has led to great strides in speech processing.
However, the resources needed to train these models has become prohibitively large.
In this work, we optimize HuBERT SSL to fit in academic constraints.
arXiv Detail & Related papers (2023-06-11T12:53:46Z) - DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech
Models [34.464301065191336]
Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment.
We propose DPHuBERT, a novel task-agnostic compression method for speech SSL based on joint distillation and pruning.
arXiv Detail & Related papers (2023-05-28T07:09:33Z) - LightHuBERT: Lightweight and Configurable Speech Representation Learning
with Once-for-All Hidden-Unit BERT [69.77358429702873]
We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically.
Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures.
LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
arXiv Detail & Related papers (2022-03-29T14:20:55Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z) - HuBERT: Self-Supervised Speech Representation Learning by Masked
Prediction of Hidden Units [81.53783563025084]
We propose an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
A key ingredient of our approach is applying the prediction loss over the masked regions only.
HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
arXiv Detail & Related papers (2021-06-14T14:14:28Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.