Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey
- URL: http://arxiv.org/abs/2412.06602v1
- Date: Mon, 09 Dec 2024 15:50:25 GMT
- Title: Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey
- Authors: Tianxin Xie, Yan Rong, Pengfei Zhang, Li Liu,
- Abstract summary: Text-to-speech (TTS) is a prominent research area that aims to generate natural-sounding human speech from text.
With the increasing industrial demand, TTS technologies have evolved beyond human-like speech to enabling controllable speech generation.
In this paper, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts.
- Score: 8.476093391815766
- License:
- Abstract: Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that aims to generate natural-sounding human speech from text. Recently, with the increasing industrial demand, TTS technologies have evolved beyond synthesizing human-like speech to enabling controllable speech generation. This includes fine-grained control over various attributes of synthesized speech such as emotion, prosody, timbre, and duration. Besides, advancements in deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS over the past several years. In this paper, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research. We examine the general controllable TTS pipeline, challenges, model architectures, and control strategies, offering a comprehensive and clear taxonomy of existing methods. Additionally, we provide a detailed summary of datasets and evaluation metrics and shed some light on the applications and future directions of controllable TTS. To the best of our knowledge, this survey paper provides the first comprehensive review of emerging controllable TTS methods, which can serve as a beneficial resource for both academic researchers and industry practitioners.
Related papers
- Roadmap towards Superhuman Speech Understanding using Large Language Models [60.57947401837938]
Large language models (LLMs) integrate speech and audio data.
Recent advances, such as GPT-4o, highlight the potential for end-to-end speech LLMs.
We propose a five-level roadmap, ranging from basic automatic speech recognition (ASR) to advanced superhuman models.
arXiv Detail & Related papers (2024-10-17T06:44:06Z) - A Survey of Text Style Transfer: Applications and Ethical Implications [4.749824105387292]
Text style transfer (TST) aims to control selected attributes of language use, such as politeness, formality, or sentiment, without altering the style-independent content of the text.
This paper presents a comprehensive review of TST applications that have been researched over the years, using both traditional linguistic approaches and more recent deep learning methods.
arXiv Detail & Related papers (2024-07-23T17:15:23Z) - Text to speech synthesis [0.27195102129095]
Text-to-speech synthesis (TTS) is a technology that converts written text into spoken words.
This abstract explores the key aspects of TTS synthesis, encompassing its underlying technologies, applications, and implications for various sectors.
arXiv Detail & Related papers (2024-01-25T02:13:45Z) - A review-based study on different Text-to-Speech technologies [0.0]
The paper examines the different TTS technologies available, including concatenative TTS, formant synthesis TTS, and statistical parametric TTS.
The study focuses on comparing the advantages and limitations of these technologies in terms of their naturalness of voice, the level of complexity of the system, and their suitability for different applications.
arXiv Detail & Related papers (2023-12-17T20:07:23Z) - Igniting Language Intelligence: The Hitchhiker's Guide From
Chain-of-Thought Reasoning to Language Agents [80.5213198675411]
Large language models (LLMs) have dramatically enhanced the field of language intelligence.
LLMs leverage the intriguing chain-of-thought (CoT) reasoning techniques, obliging them to formulate intermediate steps en route to deriving an answer.
Recent research endeavors have extended CoT reasoning methodologies to nurture the development of autonomous language agents.
arXiv Detail & Related papers (2023-11-20T14:30:55Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - A Survey on Neural Speech Synthesis [110.39292386792555]
Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities.
We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends.
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
arXiv Detail & Related papers (2021-06-29T16:50:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.