Guided contrastive self-supervised pre-training for automatic speech
recognition
- URL: http://arxiv.org/abs/2210.12335v1
- Date: Sat, 22 Oct 2022 02:38:43 GMT
- Title: Guided contrastive self-supervised pre-training for automatic speech
recognition
- Authors: Aparna Khare, Minhua Wu, Saurabhchand Bhati, Jasha Droppo, Roland Maas
- Abstract summary: Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model.
We present a novel modification of CPC called Guided Contrastive Predictive Coding (GCPC)
Our proposed method maximizes the mutual information between representations from a prior-knowledge model and the output of the model being pre-trained, allowing prior knowledge injection during pre-training.
- Score: 16.038298927903632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Predictive Coding (CPC) is a representation learning method that
maximizes the mutual information between intermediate latent representations
and the output of a given model. It can be used to effectively initialize the
encoder of an Automatic Speech Recognition (ASR) model. We present a novel
modification of CPC called Guided Contrastive Predictive Coding (GCPC). Our
proposed method maximizes the mutual information between representations from a
prior-knowledge model and the output of the model being pre-trained, allowing
prior knowledge injection during pre-training. We validate our method on 3 ASR
tasks: German, French and English. Our method outperforms CPC pre-training on
all three datasets, reducing the Word Error Rate (WER) by 4.44%, 6.55% and
15.43% relative on the German, French and English (Librispeech) tasks
respectively, compared to training from scratch, while CPC pre-training only
brings 2.96%, 1.01% and 14.39% relative WER reduction respectively.
Related papers
- Unveiling the Role of Pretraining in Direct Speech Translation [14.584351239812394]
We compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch.
We observe that, throughout the training, the randomly model struggles to incorporate information from the speech inputs for its predictions.
We propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training.
arXiv Detail & Related papers (2024-09-26T16:46:46Z) - Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - Headless Language Models: Learning without Predicting with Contrastive
Weight Tying [0.11510009152620666]
Self-supervised pre-training of language models usually consists in predicting probability distributions over extensive token vocabularies.
We propose an innovative method that shifts away from probability prediction and instead focuses on reconstructing input embeddings in a contrastive fashion via Constrastive Weight Tying (CWT)
We observe a significant +1.6 GLUE score increase and a notable +2.7 LAMBADA accuracy improvement compared to classical LMs within similar compute budgets.
arXiv Detail & Related papers (2023-09-15T12:20:00Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Zero-Shot Automatic Pronunciation Assessment [19.971348810774046]
We propose a novel zero-shot APA method based on the pre-trained acoustic model, HuBERT.
Experimental results on speechocean762 demonstrate that the proposed method achieves comparable performance to supervised regression baselines.
arXiv Detail & Related papers (2023-05-31T05:17:17Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - A Method to Reveal Speaker Identity in Distributed ASR Training, and How
to Counter It [3.18475216176047]
We design the first method for revealing the identity of the speaker of a training utterance with access only to a gradient.
We show that it is possible to reveal the speaker's identity with 34% top-1 accuracy (51% top-5 accuracy) on the LibriSpeech dataset.
arXiv Detail & Related papers (2021-04-15T23:15:12Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Incremental Learning for End-to-End Automatic Speech Recognition [41.297106772785206]
We propose an incremental learning method for end-to-end Automatic Speech Recognition (ASR)
We design a novel explainability-based knowledge distillation for ASR models, which is combined with a response-based knowledge distillation to maintain the original model's predictions and the "reason" for the predictions.
Results on a multi-stage sequential training task show that our method outperforms existing ones in mitigating forgetting.
arXiv Detail & Related papers (2020-05-11T08:18:08Z) - Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less
Forgetting [66.45372974713189]
We propose a recall and learn mechanism, which adopts the idea of multi-task learning and jointly learns pretraining tasks and downstream tasks.
Experiments show that our method achieves state-of-the-art performance on the GLUE benchmark.
We provide open-source RecAdam, which integrates the proposed mechanisms into Adam to facility the NLP community.
arXiv Detail & Related papers (2020-04-27T08:59:57Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.