ImportantAug: a data augmentation agent for speech
- URL: http://arxiv.org/abs/2112.07156v1
- Date: Tue, 14 Dec 2021 04:37:04 GMT
- Title: ImportantAug: a data augmentation agent for speech
- Authors: Viet Anh Trinh (1), Hassan Salami Kavaki (1), Michael I Mandel (1 and
2) ((1) CUNY Graduate Center, (2) Brooklyn College)
- Abstract summary: We introduce ImportantAug, a technique to augment training data for speech classification and recognition models.
Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds.
- Score: 10.453223310129408
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce ImportantAug, a technique to augment training data for speech
classification and recognition models by adding noise to unimportant regions of
the speech and not to important regions. Importance is predicted for each
utterance by a data augmentation agent that is trained to maximize the amount
of noise it adds while minimizing its impact on recognition performance. The
effectiveness of our method is illustrated on version two of the Google Speech
Commands (GSC) dataset. On the standard GSC test set, it achieves a 23.3%
relative error rate reduction compared to conventional noise augmentation which
applies noise to speech without regard to where it might be most effective. It
also provides a 25.4% error rate reduction compared to a baseline without data
augmentation. Additionally, the proposed ImportantAug outperforms the
conventional noise augmentation and the baseline on two test sets with
additional noise added.
Related papers
- SpeechBlender: Speech Augmentation Framework for Mispronunciation Data
Generation [11.91301106502376]
SpeechBlender is a fine-grained data augmentation pipeline for generating mispronunciation errors.
Our proposed technique achieves state-of-the-art results, with Speechocean762, on ASR dependent mispronunciation detection models.
arXiv Detail & Related papers (2022-11-02T07:13:30Z) - An Experimental Study on Private Aggregation of Teacher Ensemble
Learning for End-to-End Speech Recognition [51.232523987916636]
Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data.
In this work, we extend PATE learning to work with dynamic patterns, namely speech, and perform one very first experimental study on ASR to avoid acoustic data leakage.
arXiv Detail & Related papers (2022-10-11T16:55:54Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Interactive Feature Fusion for End-to-End Noise-Robust Speech
Recognition [25.84784710031567]
We propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition.
Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline.
Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.
arXiv Detail & Related papers (2021-10-11T13:40:07Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - MixSpeech: Data Augmentation for Low-resource Automatic Speech
Recognition [54.84624870942339]
MixSpeech is a simple yet effective data augmentation method based on mixup for automatic speech recognition (ASR)
We apply MixSpeech on two popular end-to-end speech recognition models including LAS (Listen, Attend and Spell) and Transformer.
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation.
arXiv Detail & Related papers (2021-02-25T03:40:43Z) - Data augmentation using prosody and false starts to recognize non-native
children's speech [12.911954427107977]
This paper describes AaltoASR's speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition.
The task is to recognize non-native speech from children of various age groups given a limited amount of speech.
arXiv Detail & Related papers (2020-08-29T05:32:32Z) - Data Augmenting Contrastive Learning of Speech Representations in the
Time Domain [92.50459322938528]
We introduce WavAugment, a time-domain data augmentation library.
We find that a combination of pitch modification, additive noise and reverberation substantially increase the performance of CPC.
We also show that time-domain data augmentation consistently improves downstream limited-supervision phoneme classification tasks by a factor of 12-15% relative.
arXiv Detail & Related papers (2020-07-02T09:59:51Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.