A Survey on Neural Speech Synthesis
- URL: http://arxiv.org/abs/2106.15561v2
- Date: Wed, 30 Jun 2021 15:58:54 GMT
- Title: A Survey on Neural Speech Synthesis
- Authors: Xu Tan, Tao Qin, Frank Soong, Tie-Yan Liu
- Abstract summary: Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities.
We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends.
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
- Score: 110.39292386792555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text to speech (TTS), or speech synthesis, which aims to synthesize
intelligible and natural speech given text, is a hot research topic in speech,
language, and machine learning communities and has broad applications in the
industry. As the development of deep learning and artificial intelligence,
neural network-based TTS has significantly improved the quality of synthesized
speech in recent years. In this paper, we conduct a comprehensive survey on
neural TTS, aiming to provide a good understanding of current research and
future trends. We focus on the key components in neural TTS, including text
analysis, acoustic models and vocoders, and several advanced topics, including
fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
We further summarize resources related to TTS (e.g., datasets, opensource
implementations) and discuss future research directions. This survey can serve
both academic researchers and industry practitioners working on TTS.
Related papers
- On the Problem of Text-To-Speech Model Selection for Synthetic Data Generation in Automatic Speech Recognition [31.58289343561422]
We compare five different TTS decoder architectures in the scope of synthetic data generation to show the impact on CTC-based speech recognition training.
For data generation auto-regressive decoding performs better than non-autoregressive decoding, and propose an approach to quantify TTS generalization capabilities.
arXiv Detail & Related papers (2024-07-31T09:37:27Z) - MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis [70.06396781553191]
Multimodal Emotional Text-to-Speech System (MM-TTS) is a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
MM-TTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, and the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Text to speech synthesis [0.27195102129095]
Text-to-speech synthesis (TTS) is a technology that converts written text into spoken words.
This abstract explores the key aspects of TTS synthesis, encompassing its underlying technologies, applications, and implications for various sectors.
arXiv Detail & Related papers (2024-01-25T02:13:45Z) - A review-based study on different Text-to-Speech technologies [0.0]
The paper examines the different TTS technologies available, including concatenative TTS, formant synthesis TTS, and statistical parametric TTS.
The study focuses on comparing the advantages and limitations of these technologies in terms of their naturalness of voice, the level of complexity of the system, and their suitability for different applications.
arXiv Detail & Related papers (2023-12-17T20:07:23Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - Synth2Aug: Cross-domain speaker recognition with TTS synthesized speech [8.465993273653554]
We investigate the use of a multi-speaker Text-To-Speech system to synthesize speech in support of speaker recognition.
We observe on our datasets that TTS synthesized speech improves cross-domain speaker recognition performance.
We also explore the effectiveness of different types of text transcripts used for TTS synthesis.
arXiv Detail & Related papers (2020-11-24T00:48:54Z) - GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech
Synthesis [79.1885389845874]
Transformer-based end-to-end text-to-speech synthesis (TTS) is one of such successful implementations.
We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework.
Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.
arXiv Detail & Related papers (2020-10-23T14:14:06Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.