Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training
- URL: http://arxiv.org/abs/2206.10125v1
- Date: Tue, 21 Jun 2022 06:08:30 GMT
- Title: Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training
- Authors: Chengyi Wang, Yiming Wang, Yu Wu, Sanyuan Chen, Jinyu Li, Shujie Liu,
Furu Wei
- Abstract summary: Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
- Score: 102.14558233502514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, masked prediction pre-training has seen remarkable progress in
self-supervised learning (SSL) for speech recognition. It usually requires a
codebook obtained in an unsupervised way, making it less accurate and difficult
to interpret. We propose two supervision-guided codebook generation approaches
to improve automatic speech recognition (ASR) performance and also the
pre-training efficiency, either through decoding with a hybrid ASR system to
generate phoneme-level alignments (named PBERT), or performing clustering on
the supervised speech features extracted from an end-to-end CTC model (named
CTC clustering). Both the hybrid and CTC models are trained on the same small
amount of labeled speech as used in fine-tuning. Experiments demonstrate
significant superiority of our methods to various SSL and self-training
baselines, with up to 17.0% relative WER reduction. Our pre-trained models also
show good transferability in a non-ASR speech task.
Related papers
- Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Improved Consistency Training for Semi-Supervised Sequence-to-Sequence
ASR via Speech Chain Reconstruction and Self-Transcribing [21.049557187137776]
We propose an improved consistency training paradigm of semi-supervised S2S ASR.
We utilize speech chain reconstruction as the weak augmentation to generate high-quality pseudo labels.
Our improved paradigm achieves a 12.2% CER improvement in the single-speaker setting and 38.6% in the multi-speaker setting.
arXiv Detail & Related papers (2022-05-14T04:26:13Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in
End-to-End Speech-to-Intent Systems [31.18865184576272]
This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis.
We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder.
Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets.
arXiv Detail & Related papers (2022-04-11T15:24:25Z) - Combining Unsupervised and Text Augmented Semi-Supervised Learning for
Low Resourced Autoregressive Speech Recognition [7.067186994804316]
We pretrain state-of-the-art Conformer models in an unsupervised manner.
Additional text data is incorporated through external language models.
Final performance is an additional 2% better absolute when using CTC-based decoding for semi-supervised training.
arXiv Detail & Related papers (2021-10-29T14:59:18Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z) - Joint Masked CPC and CTC Training for ASR [29.41599824919278]
We demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data.
We show that this joint training method directly optimized performance for the downstream ASR task using unsupervised data.
arXiv Detail & Related papers (2020-10-30T20:28:20Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.