Comparing the Benefit of Synthetic Training Data for Various Automatic
Speech Recognition Architectures
- URL: http://arxiv.org/abs/2104.05379v1
- Date: Mon, 12 Apr 2021 11:59:23 GMT
- Title: Comparing the Benefit of Synthetic Training Data for Various Automatic
Speech Recognition Architectures
- Authors: Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, Ralf
Schl\"uter, Hermann Ney
- Abstract summary: We present a novel approach of silence correction in the data pre-processing for TTS systems.
We achieve a final word-error-rate of 3.3%/10.0% with a Hybrid system on the clean/noisy test-sets.
- Score: 44.803590841664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent publications on automatic-speech-recognition (ASR) have a strong focus
on attention encoder-decoder (AED) architectures which work well for large
datasets, but tend to overfit when applied in low resource scenarios. One
solution to tackle this issue is to generate synthetic data with a trained
text-to-speech system (TTS) if additional text is available. This was
successfully applied in many publications with AED systems. We present a novel
approach of silence correction in the data pre-processing for TTS systems which
increases the robustness when training on corpora targeted for ASR
applications. In this work we do not only show the successful application of
synthetic data for AED systems, but also test the same method on a highly
optimized state-of-the-art Hybrid ASR system and a competitive monophone based
system using connectionist-temporal-classification (CTC). We show that for the
later systems the addition of synthetic data only has a minor effect, but they
still outperform the AED systems by a large margin on LibriSpeech-100h. We
achieve a final word-error-rate of 3.3%/10.0% with a Hybrid system on the
clean/noisy test-sets, surpassing any previous state-of-the-art systems that do
not include unlabeled audio data.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures [19.823015917720284]
We evaluate the utility of synthetic data for training automatic speech recognition.
We reproduce the original training data, training ASR systems solely on synthetic data.
We show that the TTS models generalize well, even when training scores indicate overfitting.
arXiv Detail & Related papers (2024-07-25T12:44:45Z) - On the Relevance of Phoneme Duration Variability of Synthesized Training
Data for Automatic Speech Recognition [0.552480439325792]
We focus on the temporal structure of synthetic data and its relation to ASR training.
We show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive TTS.
Using a simple algorithm we shift phoneme duration distributions of the TTS system closer to real durations.
arXiv Detail & Related papers (2023-10-12T08:45:21Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Wider or Deeper Neural Network Architecture for Acoustic Scene
Classification with Mismatched Recording Devices [59.86658316440461]
We present a robust and low complexity system for Acoustic Scene Classification (ASC)
We first construct an ASC baseline system in which a novel inception-residual-based network architecture is proposed to deal with the mismatched recording device issue.
To further improve the performance but still satisfy the low complexity model, we apply two techniques: ensemble of multiple spectrograms and channel reduction.
arXiv Detail & Related papers (2022-03-23T10:27:41Z) - Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe.
We study different training aspects and methods to improve word-error-rate as well as to increase training speed.
We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z) - SynthASR: Unlocking Synthetic Data for Speech Recognition [15.292920497489925]
We propose to utilize synthetic speech for ASR training ( SynthASR) in applications where data is sparse or hard to get for ASR model training.
In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio improved the recognition performance on new application by more than 65% relative.
arXiv Detail & Related papers (2021-06-14T23:26:44Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.