Simple and Effective Unsupervised Speech Synthesis
- URL: http://arxiv.org/abs/2204.02524v2
- Date: Thu, 7 Apr 2022 02:46:21 GMT
- Title: Simple and Effective Unsupervised Speech Synthesis
- Authors: Alexander H. Liu, Cheng-I Jeff Lai, Wei-Ning Hsu, Michael Auli, Alexei
Baevskiv, James Glass
- Abstract summary: We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe.
Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus.
- Score: 97.56065543192699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the first unsupervised speech synthesis system based on a
simple, yet effective recipe. The framework leverages recent work in
unsupervised speech recognition as well as existing neural-based speech
synthesis. Using only unlabeled speech audio and unlabeled text as well as a
lexicon, our method enables speech synthesis without the need for a
human-labeled corpus. Experiments demonstrate the unsupervised system can
synthesize speech similar to a supervised counterpart in terms of naturalness
and intelligibility measured by human evaluation.
Related papers
- Towards Unsupervised Speech Recognition Without Pronunciation Models [57.222729245842054]
Most languages lack sufficient paired speech and text data to effectively train automatic speech recognition systems.
We propose the removal of reliance on a phoneme lexicon to develop unsupervised ASR systems.
We experimentally demonstrate that an unsupervised speech recognizer can emerge from joint speech-to-speech and text-to-text masked token-infilling.
arXiv Detail & Related papers (2024-06-12T16:30:58Z) - Controllable Generation of Artificial Speaker Embeddings through
Discovery of Principal Directions [29.03308434639149]
We propose a method to generate artificial speaker embeddings that cannot be linked to a real human.
The controllable embeddings can be fed to a speech synthesis system conditioned on embeddings of real humans during training.
arXiv Detail & Related papers (2023-10-26T15:54:12Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech
Recognition [60.84668086976436]
An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language.
This paper proposes an unsupervised TTS system by leveraging recent advances in unsupervised automatic speech recognition (ASR)
Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each.
arXiv Detail & Related papers (2022-03-29T17:57:53Z) - Speech Resynthesis from Discrete Disentangled Self-Supervised
Representations [49.48053138928408]
We propose using self-supervised discrete representations for the task of speech resynthesis.
We extract low-bitrate representations for speech content, prosodic information, and speaker identity.
Using the obtained representations, we can get to a rate of 365 bits per second while providing better speech quality than the baseline methods.
arXiv Detail & Related papers (2021-04-01T09:20:33Z) - Detection of AI-Synthesized Speech Using Cepstral & Bispectral
Statistics [0.0]
We propose an approach to distinguish human speech from AI synthesized speech.
Higher-order statistics have less correlation for human speech in comparison to a synthesized speech.
Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech.
arXiv Detail & Related papers (2020-09-03T21:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.