HuBERT: Self-Supervised Speech Representation Learning by Masked
Prediction of Hidden Units
- URL: http://arxiv.org/abs/2106.07447v1
- Date: Mon, 14 Jun 2021 14:14:28 GMT
- Title: HuBERT: Self-Supervised Speech Representation Learning by Masked
Prediction of Hidden Units
- Authors: Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia,
Ruslan Salakhutdinov, Abdelrahman Mohamed
- Abstract summary: We propose an offline clustering step to provide aligned target labels for a BERT-like prediction loss.
A key ingredient of our approach is applying the prediction loss over the masked regions only.
HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.
- Score: 81.53783563025084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised approaches for speech representation learning are challenged
by three unique problems: (1) there are multiple sound units in each input
utterance, (2) there is no lexicon of input sound units during the pre-training
phase, and (3) sound units have variable lengths with no explicit segmentation.
To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT)
approach for self-supervised speech representation learning, which utilizes an
offline clustering step to provide aligned target labels for a BERT-like
prediction loss. A key ingredient of our approach is applying the prediction
loss over the masked regions only, which forces the model to learn a combined
acoustic and language model over the continuous inputs. HuBERT relies primarily
on the consistency of the unsupervised clustering step rather than the
intrinsic quality of the assigned cluster labels. Starting with a simple
k-means teacher of 100 clusters, and using two iterations of clustering, the
HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0
performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with
10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model,
HuBERT shows up to 19% and 13% relative WER reduction on the more challenging
dev-other and test-other evaluation subsets.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Detecting Speech Abnormalities with a Perceiver-based Sequence
Classifier that Leverages a Universal Speech Model [4.503292461488901]
We propose a Perceiver-based sequence to detect abnormalities in speech reflective of several neurological disorders.
We combine this sequence with a Universal Speech Model (USM) that is trained (unsupervised) on 12 million hours of diverse audio recordings.
Our model outperforms standard transformer (80.9%) and perceiver (81.8%) models and achieves an average accuracy of 83.1%.
arXiv Detail & Related papers (2023-10-16T21:07:12Z) - Representation Learning With Hidden Unit Clustering For Low Resource
Speech Applications [37.89857769906568]
We describe an approach to self-supervised representation learning from raw audio using a hidden unit clustering (HUC) framework.
The input to the model consists of audio samples that are windowed and processed with 1-D convolutional layers.
The HUC framework, allowing the categorization of the representations into a small number of phoneme-like units, is used to train the model for learning semantically rich speech representations.
arXiv Detail & Related papers (2023-07-14T13:02:10Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - Speaker Recognition in the Wild [0.0]
We propose a pipeline to find the number of speakers, as well as audios belonging to each of these now identified speakers.
We use this approach as a part of our Data Preparation pipeline for Speech Recognition in Indic languages.
arXiv Detail & Related papers (2022-05-05T07:17:17Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.