SPIRAL: Self-supervised Perturbation-Invariant Representation Learning
for Speech Pre-Training
- URL: http://arxiv.org/abs/2201.10207v1
- Date: Tue, 25 Jan 2022 09:53:36 GMT
- Title: SPIRAL: Self-supervised Perturbation-Invariant Representation Learning
for Speech Pre-Training
- Authors: Wenyong Huang, Zhenhe Zhang, Yu Ting Yeung, Xin Jiang, Qun Liu
- Abstract summary: SPIRAL works by learning denoising representation of perturbed data in a teacher-student framework.
We address the problem of noise-robustness that is critical to real-world speech applications.
- Score: 25.80559992732508
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new approach for speech pre-training named SPIRAL which works
by learning denoising representation of perturbed data in a teacher-student
framework. Specifically, given a speech utterance, we first feed the utterance
to a teacher network to obtain corresponding representation. Then the same
utterance is perturbed and fed to a student network. The student network is
trained to output representation resembling that of the teacher. At the same
time, the teacher network is updated as moving average of student's weights
over training steps. In order to prevent representation collapse, we apply an
in-utterance contrastive loss as pre-training objective and impose position
randomization on the input to the teacher. SPIRAL achieves competitive or
better results compared to state-of-the-art speech pre-training method wav2vec
2.0, with significant reduction of training cost (80% for Base model, 65% for
Large model). Furthermore, we address the problem of noise-robustness that is
critical to real-world speech applications. We propose multi-condition
pre-training by perturbing the student's input with various types of additive
noise. We demonstrate that multi-condition pre-trained SPIRAL models are more
robust to noisy speech (9.0% - 13.3% relative word error rate reduction on real
noisy test data), compared to applying multi-condition training solely in the
fine-tuning stage. The code will be released after publication.
Related papers
- Unveiling the Role of Pretraining in Direct Speech Translation [14.584351239812394]
We compare the training dynamics of a system using a pretrained encoder, the conventional approach, and one trained from scratch.
We observe that, throughout the training, the randomly model struggles to incorporate information from the speech inputs for its predictions.
We propose a subtle change in the decoder cross-attention to integrate source information from earlier steps in training.
arXiv Detail & Related papers (2024-09-26T16:46:46Z) - Improving noisy student training for low-resource languages in End-to-End ASR using CycleGAN and inter-domain losses [28.74405969209494]
Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance.
This paper considers a more extreme case of semi-supervised end-to-end automatic speech recognition where there are limited paired speech-text, unlabeled speech, and abundant external text.
arXiv Detail & Related papers (2024-07-26T10:57:06Z) - INTapt: Information-Theoretic Adversarial Prompt Tuning for Enhanced
Non-Native Speech Recognition [43.228070238684786]
We propose Information Theoretic Adversarial Prompt Tuning (INTapt) to mitigate representational bias in automatic speech recognition systems.
INTapt is trained simultaneously in the following two manners: (1) adversarial training to reduce accent feature dependence between the original input and the prompt-concatenated input, and (2) training to minimize CTC loss for improving ASR performance to a prompt-concatenated input.
Experimental results show that INTapt improves the performance of L2 English and increases feature similarity between L2 and L1 accents.
arXiv Detail & Related papers (2023-05-25T13:06:01Z) - A Training and Inference Strategy Using Noisy and Enhanced Speech as
Target for Speech Enhancement without Clean Speech [24.036987059698415]
We propose a training and inference strategy that additionally uses enhanced speech as a target.
Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing.
Experimental results show that our proposed method outperforms several baselines.
arXiv Detail & Related papers (2022-10-27T12:26:24Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Injecting Text in Self-Supervised Speech Pretraining [33.676479965610774]
We propose to jointly learn representations during pretraining from two different modalities: speech and text.
tts4pretrain complements the power of contrastive learning in self-supervision.
We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
arXiv Detail & Related papers (2021-08-27T11:36:40Z) - TAVAT: Token-Aware Virtual Adversarial Training for Language
Understanding [55.16953347580948]
Gradient-based adversarial training is widely used in improving the robustness of neural networks.
It cannot be easily adapted to natural language processing tasks since the embedding space is discrete.
We propose a Token-Aware Virtual Adrial Training method to craft fine-grained perturbations.
arXiv Detail & Related papers (2020-04-30T02:03:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.