Review of end-to-end speech synthesis technology based on deep learning
- URL: http://arxiv.org/abs/2104.09995v1
- Date: Tue, 20 Apr 2021 14:24:05 GMT
- Title: Review of end-to-end speech synthesis technology based on deep learning
- Authors: Zhaoxi Mu, Xinyu Yang, Yizhuo Dong
- Abstract summary: Research focus is the deep learning-based end-to-end speech synthesis technology.
It mainly consists of three modules: text front-end, acoustic model, and vocoder.
This paper summarizes the open-source speech corpus of English, Chinese and other languages that can be used for speech synthesis tasks.
- Score: 10.748200013505882
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As an indispensable part of modern human-computer interaction system, speech
synthesis technology helps users get the output of intelligent machine more
easily and intuitively, thus has attracted more and more attention. Due to the
limitations of high complexity and low efficiency of traditional speech
synthesis technology, the current research focus is the deep learning-based
end-to-end speech synthesis technology, which has more powerful modeling
ability and a simpler pipeline. It mainly consists of three modules: text
front-end, acoustic model, and vocoder. This paper reviews the research status
of these three parts, and classifies and compares various methods according to
their emphasis. Moreover, this paper also summarizes the open-source speech
corpus of English, Chinese and other languages that can be used for speech
synthesis tasks, and introduces some commonly used subjective and objective
speech quality evaluation method. Finally, some attractive future research
directions are pointed out.
Related papers
- Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text Analysis [3.8251125989631674]
We propose an end-to-end context-aware Text-to-Speech (TTS) synthesis system.
It derives the conveyed emotion from text input and synthesises audio that focuses on emotions and speaker features for natural and expressive speech.
Our system showcases competitive inference time performance when benchmarked against state-of-the-art TTS models.
arXiv Detail & Related papers (2024-10-24T23:18:02Z) - Text to speech synthesis [0.27195102129095]
Text-to-speech synthesis (TTS) is a technology that converts written text into spoken words.
This abstract explores the key aspects of TTS synthesis, encompassing its underlying technologies, applications, and implications for various sectors.
arXiv Detail & Related papers (2024-01-25T02:13:45Z) - All-for-One and One-For-All: Deep learning-based feature fusion for
Synthetic Speech Detection [18.429817510387473]
Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever.
In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them.
The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.
arXiv Detail & Related papers (2023-07-28T13:50:25Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - An Overview of Affective Speech Synthesis and Conversion in the Deep
Learning Era [39.91844543424965]
Affect, or expressivity, has the capacity to turn speech into a medium capable of conveying intimate thoughts, feelings, and emotions.
Following recent advances in text-to-speech synthesis, a paradigm shift is well under way in the fields of affective speech synthesis and conversion.
Deep learning, the technology which underlies most of the recent advances in artificial intelligence, is spearheading these efforts.
arXiv Detail & Related papers (2022-10-06T13:55:59Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Spoken Style Learning with Multi-modal Hierarchical Context Encoding for
Conversational Text-to-Speech Synthesis [59.27994987902646]
The study about learning spoken styles from historical conversations is still in its infancy.
Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches.
We propose a spoken style learning approach with multi-modal hierarchical context encoding.
arXiv Detail & Related papers (2021-06-11T08:33:52Z) - SpeechBrain: A General-Purpose Speech Toolkit [73.0404642815335]
SpeechBrain is an open-source and all-in-one speech toolkit.
It is designed to facilitate the research and development of neural speech processing technologies.
It achieves competitive or state-of-the-art performance in a wide range of speech benchmarks.
arXiv Detail & Related papers (2021-06-08T18:22:56Z) - Speech Synthesis as Augmentation for Low-Resource ASR [7.2244067948447075]
Speech synthesis might hold the key to low-resource speech recognition.
Data augmentation techniques have become an essential part of modern speech recognition training.
Speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech.
arXiv Detail & Related papers (2020-12-23T22:19:42Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.