A Further Study of Unsupervised Pre-training for Transformer Based
Speech Recognition
- URL: http://arxiv.org/abs/2005.09862v2
- Date: Tue, 23 Jun 2020 03:57:48 GMT
- Title: A Further Study of Unsupervised Pre-training for Transformer Based
Speech Recognition
- Authors: Dongwei Jiang, Wubo Li, Ruixiong Zhang, Miao Cao, Ne Luo, Yang Han,
Wei Zou, Xiangang Li
- Abstract summary: Masked Predictive Coding achieved significant improvements on various speech recognition datasets with BERT-like Masked Reconstruction loss and Transformer backbone.
In this paper, we focus on three important aspects: the effect of pre-training data speaking style, its extension on streaming model, and how to better transfer learned knowledge from pre-training stage to downstream tasks.
- Score: 19.415695923461342
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Building a good speech recognition system usually requires large amounts of
transcribed data, which is expensive to collect. To tackle this problem, many
unsupervised pre-training methods have been proposed. Among these methods,
Masked Predictive Coding achieved significant improvements on various speech
recognition datasets with BERT-like Masked Reconstruction loss and Transformer
backbone. However, many aspects of MPC have not been fully investigated. In
this paper, we conduct a further study on MPC and focus on three important
aspects: the effect of pre-training data speaking style, its extension on
streaming model, and how to better transfer learned knowledge from pre-training
stage to downstream tasks. Experiments reveled that pre-training data with
matching speaking style is more useful on downstream recognition tasks. A
unified training objective with APC and MPC provided 8.46% relative error
reduction on streaming model trained on HKUST. Also, the combination of target
data adaption and layer-wise discriminative training helped the knowledge
transfer of MPC, which achieved 3.99% relative error reduction on AISHELL over
a strong baseline.
Related papers
- Efficient Machine Unlearning via Influence Approximation [75.31015485113993]
Influence-based unlearning has emerged as a prominent approach to estimate the impact of individual training samples on model parameters without retraining.<n>This paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning)<n>We introduce the Influence Approximation Unlearning algorithm for efficient machine unlearning from the incremental perspective.
arXiv Detail & Related papers (2025-07-31T05:34:27Z) - Optimizing Model Splitting and Device Task Assignment for Deceptive Signal Assisted Private Multi-hop Split Learning [58.620753467152376]
In our model, several edge devices jointly perform collaborative training, and some eavesdroppers aim to collect the model and data information from devices.<n>To prevent the eavesdroppers from collecting model and data information, a subset of devices can transmit deceptive signals.<n>We propose a soft actor-critic deep reinforcement learning framework with intrinsic curiosity module and cross-attention.
arXiv Detail & Related papers (2025-07-09T22:53:23Z) - Analyzing Mitigation Strategies for Catastrophic Forgetting in End-to-End Training of Spoken Language Models [79.90523648823522]
Multi-stage continual learning can lead to catastrophic forgetting.<n>This paper evaluates three mitigation strategies-model merging, discounting the LoRA scaling factor, and experience replay.<n>Results show that experience replay is the most effective, with further gains achieved by combining it with other methods.
arXiv Detail & Related papers (2025-05-23T05:50:14Z) - Contrastive Augmentation: An Unsupervised Learning Approach for Keyword Spotting in Speech Technology [4.080686348274667]
We introduce a novel approach combining unsupervised contrastive learning and a augmentation unique-based technique.
Our method allows the neural network to train on unlabeled data sets, potentially improving performance in downstream tasks.
We present a speech augmentation-based unsupervised learning method that utilizes the similarity between the bottleneck layer feature and the audio reconstructing information.
arXiv Detail & Related papers (2024-08-31T05:40:37Z) - What to Remember: Self-Adaptive Continual Learning for Audio Deepfake
Detection [53.063161380423715]
Existing detection models have shown remarkable success in discriminating known deepfake audio, but struggle when encountering new attack types.
We propose a continual learning approach called Radian Weight Modification (RWM) for audio deepfake detection.
arXiv Detail & Related papers (2023-12-15T09:52:17Z) - Selectivity Drives Productivity: Efficient Dataset Pruning for Enhanced
Transfer Learning [66.20311762506702]
dataset pruning (DP) has emerged as an effective way to improve data efficiency.
We propose two new DP methods, label mapping and feature mapping, for supervised and self-supervised pretraining settings.
We show that source data classes can be pruned by up to 40% 80% without sacrificing downstream performance.
arXiv Detail & Related papers (2023-10-13T00:07:49Z) - Fast Machine Unlearning Without Retraining Through Selective Synaptic
Dampening [51.34904967046097]
Selective Synaptic Dampening (SSD) is a fast, performant, and does not require long-term storage of the training data.
We present a novel two-step, post hoc, retrain-free approach to machine unlearning which is fast, performant, and does not require long-term storage of the training data.
arXiv Detail & Related papers (2023-08-15T11:30:45Z) - Guided contrastive self-supervised pre-training for automatic speech
recognition [16.038298927903632]
Contrastive Predictive Coding (CPC) is a representation learning method that maximizes the mutual information between intermediate latent representations and the output of a given model.
We present a novel modification of CPC called Guided Contrastive Predictive Coding (GCPC)
Our proposed method maximizes the mutual information between representations from a prior-knowledge model and the output of the model being pre-trained, allowing prior knowledge injection during pre-training.
arXiv Detail & Related papers (2022-10-22T02:38:43Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Streaming end-to-end speech recognition with jointly trained neural
feature enhancement [20.86554979122057]
We present a streaming end-to-end speech recognition model based on Monotonic Chunkwise Attention (MoCha) jointly trained with enhancement layers.
We introduce two training strategies: Gradual Application of Enhanced Features (GAEF) and Gradual Reduction of Enhanced Loss (GREL)
arXiv Detail & Related papers (2021-05-04T02:25:41Z) - Improving speech recognition models with small samples for air traffic
control systems [9.322392779428505]
In this work, a novel training approach based on pretraining and transfer learning is proposed to address the issue of small training samples.
Three real ATC datasets are used to validate the proposed ASR model and training strategies.
The experimental results demonstrate that the ASR performance is significantly improved on all three datasets.
arXiv Detail & Related papers (2021-02-16T08:28:52Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less
Forgetting [66.45372974713189]
We propose a recall and learn mechanism, which adopts the idea of multi-task learning and jointly learns pretraining tasks and downstream tasks.
Experiments show that our method achieves state-of-the-art performance on the GLUE benchmark.
We provide open-source RecAdam, which integrates the proposed mechanisms into Adam to facility the NLP community.
arXiv Detail & Related papers (2020-04-27T08:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.