MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
- URL: http://arxiv.org/abs/2406.05661v2
- Date: Thu, 15 Aug 2024 10:27:45 GMT
- Title: MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations
- Authors: Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah,
- Abstract summary: MS-HuBERT is an end-to-end self-supervised pre-training method for learning robust speech representations.
It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits.
- Score: 43.479279052047985
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during pre-training encode essential information for improving performance of content based tasks such as ASR.
Related papers
- Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech
Representation Learning [2.120033481952703]
Speech-based SSL models face a common dilemma in terms of computational cost.
Fast-HuBERT can be trained in 1.1 days with 8 V100 GPUs on the Librispeech 960h benchmark, without performance degradation.
arXiv Detail & Related papers (2023-09-25T04:07:34Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - DistilHuBERT: Speech Representation Learning by Layer-wise Distillation
of Hidden-unit BERT [69.26447267827454]
Self-supervised speech representation learning methods like wav2vec 2.0 and Hidden-unit BERT (HuBERT) leverage unlabeled speech data for pre-training.
This paper introduces DistilHuBERT, a novel multi-task learning framework to distill hidden representations from a HuBERT model directly.
arXiv Detail & Related papers (2021-10-05T09:34:44Z) - Private Language Model Adaptation for Speech Recognition [15.726921748859393]
Speech model adaptation is crucial to handle the discrepancy between server-side proxy training data and actual data received on users' local devices.
We introduce an efficient approach on continuously adapting neural network language models (NNLMs) on private devices with applications on automatic speech recognition.
arXiv Detail & Related papers (2021-09-28T00:15:43Z) - W2v-BERT: Combining Contrastive Learning and Masked Language Modeling
for Self-Supervised Speech Pre-Training [49.47516627019855]
w2v-BERT is a framework that combines contrastive learning and pre-supervised speech learning.
Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models.
arXiv Detail & Related papers (2021-08-07T06:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.