InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR
- URL: http://arxiv.org/abs/2204.00174v1
- Date: Fri, 1 Apr 2022 02:51:21 GMT
- Title: InterAug: Augmenting Noisy Intermediate Predictions for CTC-based ASR
- Authors: Yu Nakagome, Tatsuya Komatsu, Yusuke Fujita, Shuta Ichimura, Yusuke
Kida
- Abstract summary: We propose InterAug: a novel training method for CTC-based ASR using augmented intermediate representations for conditioning.
The proposed method exploits the conditioning framework of self-conditioned CTC to train robust models by conditioning with "noisy" intermediate predictions.
In experiments using augmentations simulating deletion, insertion, and substitution error, we confirmed that the trained model acquires robustness to each error.
- Score: 17.967459632339374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes InterAug: a novel training method for CTC-based ASR using
augmented intermediate representations for conditioning. The proposed method
exploits the conditioning framework of self-conditioned CTC to train robust
models by conditioning with "noisy" intermediate predictions. During the
training, intermediate predictions are changed to incorrect intermediate
predictions, and fed into the next layer for conditioning. The subsequent
layers are trained to correct the incorrect intermediate predictions with the
intermediate losses. By repeating the augmentation and the correction,
iterative refinements, which generally require a special decoder, can be
realized only with the audio encoder. To produce noisy intermediate
predictions, we also introduce new augmentation: intermediate feature space
augmentation and intermediate token space augmentation that are designed to
simulate typical errors. The combination of the proposed InterAug framework
with new augmentation allows explicit training of the robust audio encoders. In
experiments using augmentations simulating deletion, insertion, and
substitution error, we confirmed that the trained model acquires robustness to
each error, boosting the speech recognition performance of the strong
self-conditioned CTC baseline.
Related papers
- DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Better Intermediates Improve CTC Inference [37.68950144012098]
The paper first formulates self-conditioned CTC as a probabilistic model with an intermediate prediction as a latent representation.
We then propose two new conditioning methods based on the new formulation.
Experiments with the LibriSpeech dataset show relative 3%/12% performance improvement at the maximum in test clean/other sets compared to the original self-conditioned CTC.
arXiv Detail & Related papers (2022-04-01T02:51:23Z) - Relaxing the Conditional Independence Assumption of CTC-based ASR by
Conditioning on Intermediate Predictions [14.376418789524783]
We train a CTC-based ASR model with auxiliary CTC losses in intermediate layers in addition to the original CTC loss in the last layer.
Our method is easy to implement and retains the merits of CTC-based ASR: a simple model architecture and fast decoding speed.
arXiv Detail & Related papers (2021-04-06T18:00:03Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model.
Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.