Related papers: MelHuBERT: A simplified HuBERT on Mel spectrograms

MelHuBERT: A simplified HuBERT on Mel spectrograms

URL: http://arxiv.org/abs/2211.09944v3
Date: Thu, 29 Aug 2024 19:25:59 GMT
Title: MelHuBERT: A simplified HuBERT on Mel spectrograms
Authors: Tzu-Quan Lin, Hung-yi Lee, Hao Tang,
Abstract summary: We revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition.
Score: 55.608981341747246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks. However, most self-supervised models require a large amount of compute and multiple GPUs to train, significantly hampering the development of self-supervised learning. In an attempt to reduce the computation of training, we revisit the training of HuBERT, a highly successful self-supervised model. We improve and simplify several key components, including the loss function, input representation, and training in multiple stages. Our model, MelHuBERT, is able to achieve favorable performance on phone recognition, speaker identification, and automatic speech recognition against HuBERT, while saving 31.2% of the pre-training time, or equivalently 33.5% MACs per one second speech. The code and pre-trained models are available in https://github.com/nervjack2/MelHuBERT.

Related papers

TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch [15.001750645731162]
Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. We propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98% to 2.45%.
arXiv Detail & Related papers (2024-12-20T07:28:04Z)
MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations [43.479279052047985]
MS-HuBERT is an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits.
arXiv Detail & Related papers (2024-06-09T06:30:28Z)
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification [44.94458898538114]
We propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC) It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID) Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference.
arXiv Detail & Related papers (2024-02-20T02:04:38Z)
Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute [40.6786244647298]
Self-supervised learning (SSL) has led to great strides in speech processing. However, the resources needed to train these models has become prohibitively large. In this work, we optimize HuBERT SSL to fit in academic constraints.
arXiv Detail & Related papers (2023-06-11T12:53:46Z)
DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models [34.464301065191336]
Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment. We propose DPHuBERT, a novel task-agnostic compression method for speech SSL based on joint distillation and pruning.
arXiv Detail & Related papers (2023-05-28T07:09:33Z)
LightHuBERT: Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT [69.77358429702873]
We propose LightHuBERT, a once-for-all Transformer compression framework, to find the desired architectures automatically. Experiments on automatic speech recognition (ASR) and the SUPERB benchmark show the proposed LightHuBERT enables over $109$ architectures. LightHuBERT achieves comparable performance to the teacher model in most tasks with a reduction of 29% parameters.
arXiv Detail & Related papers (2022-03-29T14:20:55Z)
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z)
DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training. This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z)
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units [81.53783563025084]
We propose an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only. HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
arXiv Detail & Related papers (2021-06-14T14:14:28Z)
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model. We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks. In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.