Speech Synthesis as Augmentation for Low-Resource ASR
- URL: http://arxiv.org/abs/2012.13004v1
- Date: Wed, 23 Dec 2020 22:19:42 GMT
- Title: Speech Synthesis as Augmentation for Low-Resource ASR
- Authors: Deblin Bagchi, Shannon Wotherspoon, Zhuolin Jiang and Prasanna
Muthukumar
- Abstract summary: Speech synthesis might hold the key to low-resource speech recognition.
Data augmentation techniques have become an essential part of modern speech recognition training.
Speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech.
- Score: 7.2244067948447075
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speech synthesis might hold the key to low-resource speech recognition. Data
augmentation techniques have become an essential part of modern speech
recognition training. Yet, they are simple, naive, and rarely reflect
real-world conditions. Meanwhile, speech synthesis techniques have been rapidly
getting closer to the goal of achieving human-like speech. In this paper, we
investigate the possibility of using synthesized speech as a form of data
augmentation to lower the resources necessary to build a speech recognizer. We
experiment with three different kinds of synthesizers: statistical parametric,
neural, and adversarial. Our findings are interesting and point to new research
directions for the future.
Related papers
- Evaluating Speech Synthesis by Training Recognizers on Synthetic Speech [34.8899247119748]
We propose an evaluation technique involving the training of an ASR model on synthetic speech and assessing its performance on real speech.
Our proposed metric demonstrates a strong correlation with both MOS naturalness and MOS intelligibility when compared to SpeechLMScore and MOSNet.
arXiv Detail & Related papers (2023-10-01T15:52:48Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Using Deep Learning Techniques and Inferential Speech Statistics for AI
Synthesised Speech Recognition [0.0]
We propose a model that can help discriminate a synthesized speech from an actual human speech and also identify the source of such a synthesis.
The model outperforms the state-of-the-art approaches by classifying the AI synthesized audio from real human speech with an error rate of 1.9% and detecting the underlying architecture with an accuracy of 97%.
arXiv Detail & Related papers (2021-07-23T18:43:10Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z) - Generating coherent spontaneous speech and gesture from text [21.90157862281996]
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements)
Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data.
We put these two state-of-the-art technologies together in a coherent fashion for the first time.
arXiv Detail & Related papers (2021-01-14T16:02:21Z) - Detection of AI-Synthesized Speech Using Cepstral & Bispectral
Statistics [0.0]
We propose an approach to distinguish human speech from AI synthesized speech.
Higher-order statistics have less correlation for human speech in comparison to a synthesized speech.
Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech.
arXiv Detail & Related papers (2020-09-03T21:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.