On the Relevance of Phoneme Duration Variability of Synthesized Training
Data for Automatic Speech Recognition
- URL: http://arxiv.org/abs/2310.08132v1
- Date: Thu, 12 Oct 2023 08:45:21 GMT
- Title: On the Relevance of Phoneme Duration Variability of Synthesized Training
Data for Automatic Speech Recognition
- Authors: Nick Rossenbach, Benedikt Hilmes, Ralf Schl\"uter
- Abstract summary: We focus on the temporal structure of synthetic data and its relation to ASR training.
We show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive TTS.
Using a simple algorithm we shift phoneme duration distributions of the TTS system closer to real durations.
- Score: 0.552480439325792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic data generated by text-to-speech (TTS) systems can be used to
improve automatic speech recognition (ASR) systems in low-resource or domain
mismatch tasks. It has been shown that TTS-generated outputs still do not have
the same qualities as real data. In this work we focus on the temporal
structure of synthetic data and its relation to ASR training. By using a novel
oracle setup we show how much the degradation of synthetic data quality is
influenced by duration modeling in non-autoregressive (NAR) TTS. To get
reference phoneme durations we use two common alignment methods, a hidden
Markov Gaussian-mixture model (HMM-GMM) aligner and a neural connectionist
temporal classification (CTC) aligner. Using a simple algorithm based on random
walks we shift phoneme duration distributions of the TTS system closer to real
durations, resulting in an improvement of an ASR system using synthetic data in
a semi-supervised setting.
Related papers
- Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems.
We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS.
Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z) - On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition [31.58289343561422]
We compare five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training.
For data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
arXiv Detail & Related papers (2024-07-31T09:37:27Z) - On the Effect of Purely Synthetic Training Data for Different Automatic Speech Recognition Architectures [19.823015917720284]
We evaluate the utility of synthetic data for training automatic speech recognition.
We reproduce the original training data, training ASR systems solely on synthetic data.
We show that the TTS models generalize well, even when training scores indicate overfitting.
arXiv Detail & Related papers (2024-07-25T12:44:45Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - Text-only domain adaptation for end-to-end ASR using integrated
text-to-mel-spectrogram generator [17.44686265224974]
We propose an end-to-end Automatic Speech Recognition (ASR) system that can be trained on transcribed speech data, text-only data, or a mixture of both.
We demonstrate that the proposed training method significantly improves ASR accuracy compared to the system trained on transcribed speech only.
arXiv Detail & Related papers (2023-02-27T18:47:55Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based
On FullConv-TTS [0.0]
We propose a new text-to-speech system based on deep convolutional neural networks that does not employ any RNN components (recurrent units)
At the same time, we improve the generality and robustness of our model through a series of data augmentation methods such as Time Warping, Frequency Mask, and Time Mask.
The final experimental results show that the TTS model using only the CNN component can reduce the training time compared to the classic TTS models such as Tacotron.
arXiv Detail & Related papers (2022-10-24T14:18:43Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.