Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech
Representation Learning
- URL: http://arxiv.org/abs/2309.13860v2
- Date: Fri, 29 Sep 2023 06:48:11 GMT
- Title: Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech
Representation Learning
- Authors: Guanrou Yang, Ziyang Ma, Zhisheng Zheng, Yakun Song, Zhikang Niu, Xie
Chen
- Abstract summary: Speech-based SSL models face a common dilemma in terms of computational cost.
Fast-HuBERT can be trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without performance degradation.
- Score: 2.120033481952703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed significant advancements in self-supervised
learning (SSL) methods for speech-processing tasks. Various speech-based SSL
models have been developed and present promising performance on a range of
downstream tasks including speech recognition. However, existing speech-based
SSL models face a common dilemma in terms of computational cost, which might
hinder their potential application and in-depth academic research. To address
this issue, we first analyze the computational cost of different modules during
HuBERT pre-training and then introduce a stack of efficiency optimizations,
which is named Fast-HuBERT in this paper. The proposed Fast-HuBERT can be
trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without
performance degradation, resulting in a 5.2x speedup, compared to the original
implementation. Moreover, we explore two well-studied techniques in the
Fast-HuBERT and demonstrate consistent improvements as reported in previous
work.
Related papers
- Speech Representation Learning Revisited: The Necessity of Separate Learnable Parameters and Robust Data Augmentation [43.479279052047985]
We conduct a preliminary study to understand the importance of modeling other information using separate learnable parameters.
Our findings are twofold: first, the O-HuBERT method is able to utilize all layers to build complex features to encode other information; second, a robust data augmentation strategy is essential for learning the information required by tasks that depend on other information.
arXiv Detail & Related papers (2024-08-20T05:45:04Z) - Exploring Self-Supervised Multi-view Contrastive Learning for Speech Emotion Recognition with Limited Annotations [1.6008229267455227]
We propose a multi-view SSL pre-training technique that can be applied to various representations of speech, including the ones generated by large speech models.
Our experiments, based on wav2vec 2.0, spectral and paralinguistic features, demonstrate that the proposed framework boosts the SER performance, by up to 10% in Unweighted Average Recall.
arXiv Detail & Related papers (2024-06-12T06:06:55Z) - MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations [43.479279052047985]
MS-HuBERT is an end-to-end self-supervised pre-training method for learning robust speech representations.
It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits.
arXiv Detail & Related papers (2024-06-09T06:30:28Z) - Open Implementation and Study of BEST-RQ for Speech Processing [25.678292575349648]
BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has shown great performance on Automatic Speech Recognition (ASR)
We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
arXiv Detail & Related papers (2024-05-07T13:11:37Z) - A Comparative Study of Pre-trained Speech and Audio Embeddings for
Speech Emotion Recognition [0.0]
Speech Emotion Recognition (SER) has a wide range of applications, including dynamic analysis of customer calls, mental health assessment, and personalized language learning.
Pre-trained models (PTMs) have shown great promise in the speech and audio domain. Embeddings leveraged from these models serve as inputs for learning algorithms with applications in various downstream tasks.
We perform an extensive empirical analysis with four speech emotion datasets (CREMA-D, TESS, SAVEE, Emo-DB) by training three algorithms on the derived embeddings.
The results of our study indicate that the best performance is achieved by algorithms trained on embeddings
arXiv Detail & Related papers (2023-04-22T19:56:35Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - UniSpeech-SAT: Universal Speech Representation Learning with Speaker
Aware Pre-Training [72.004873454347]
Two methods are introduced for enhancing the unsupervised speaker information extraction.
Experiment results on SUPERB benchmark show that the proposed system achieves state-of-the-art performance.
We scale up training dataset to 94 thousand hours public audio data and achieve further performance improvement.
arXiv Detail & Related papers (2021-10-12T05:43:30Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.