Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based
On FullConv-TTS
- URL: http://arxiv.org/abs/2211.01948v3
- Date: Sun, 16 Apr 2023 05:01:29 GMT
- Title: Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based
On FullConv-TTS
- Authors: Ziqi Liang
- Abstract summary: We propose a new text-to-speech system based on deep convolutional neural networks that does not employ any RNN components (recurrent units)
At the same time, we improve the generality and robustness of our model through a series of data augmentation methods such as Time Warping, Frequency Mask, and Time Mask.
The final experimental results show that the TTS model using only the CNN component can reduce the training time compared to the classic TTS models such as Tacotron.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recurrent Neural Networks (RNNs) have become the standard modeling technique
for sequence data, and are used in a number of novel text-to-speech models.
However, training a TTS model including RNN components has certain requirements
for GPU performance and takes a long time. In contrast, studies have shown that
CNN-based sequence synthesis technology can greatly reduce training time in
text-to-speech models while ensuring a certain performance due to its high
parallelism. We propose a new text-to-speech system based on deep convolutional
neural networks that does not employ any RNN components (recurrent units). At
the same time, we improve the generality and robustness of our model through a
series of data augmentation methods such as Time Warping, Frequency Mask, and
Time Mask. The final experimental results show that the TTS model using only
the CNN component can reduce the training time compared to the classic TTS
models such as Tacotron while ensuring the quality of the synthesized speech.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Language Modeling Using Tensor Trains [11.19279979601076]
We propose a novel tensor network language model based on the simplest tensor network (i.e., tensor trains), called Tensor Train Language Model' (TTLM)
TTLM represents sentences in an exponential space constructed by the tensor product of words, but computing the probabilities of sentences in a low-dimensional fashion.
arXiv Detail & Related papers (2024-05-07T18:09:47Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - On the Relevance of Phoneme Duration Variability of Synthesized Training
Data for Automatic Speech Recognition [0.552480439325792]
We focus on the temporal structure of synthetic data and its relation to ASR training.
We show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive TTS.
Using a simple algorithm we shift phoneme duration distributions of the TTS system closer to real durations.
arXiv Detail & Related papers (2023-10-12T08:45:21Z) - OverFlow: Putting flows on top of neural transducers for better TTS [9.346907121576258]
Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech.
In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics.
arXiv Detail & Related papers (2022-11-13T12:53:05Z) - Model Blending for Text Classification [0.15229257192293197]
We try reducing the complexity of state of the art LSTM models for natural language tasks such as text classification, by distilling their knowledge to CNN based models, thus reducing the inference time(or latency) during testing.
arXiv Detail & Related papers (2022-08-05T05:07:45Z) - Deep Time Delay Neural Network for Speech Enhancement with Full Data
Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning.
To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Exploring Deep Hybrid Tensor-to-Vector Network Architectures for
Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size.
CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z) - Tensor train decompositions on recurrent networks [60.334946204107446]
Matrix product state (MPS) tensor trains have more attractive features than MPOs, in terms of storage reduction and computing time at inference.
We show that MPS tensor trains should be at the forefront of LSTM network compression through a theoretical analysis and practical experiments on NLP task.
arXiv Detail & Related papers (2020-06-09T18:25:39Z) - Investigation of learning abilities on linguistic features in
sequence-to-sequence text-to-speech synthesis [48.151894340550385]
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes.
We investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English.
arXiv Detail & Related papers (2020-05-20T23:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.