On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis
- URL: http://arxiv.org/abs/2110.01147v1
- Date: Mon, 4 Oct 2021 02:03:28 GMT
- Title: On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis
- Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian,
Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David
Cox, James Glass
- Abstract summary: We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
- Score: 102.80458458550999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent
can these models be pruned, and what happens to their synthesis capabilities?
This work serves as a starting point to explore pruning both spectrogram
prediction networks and vocoders. We thoroughly investigate the tradeoffs
between sparstiy and its subsequent effects on synthetic speech. Additionally,
we explored several aspects of TTS pruning: amount of finetuning data versus
sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge
distillation and pruning. Our findings suggest that not only are end-to-end TTS
models highly prunable, but also, perhaps surprisingly, pruned TTS models can
produce synthetic speech with equal or higher naturalness and intelligibility,
with similar prosody. All of our experiments are conducted on publicly
available models, and findings in this work are backed by large-scale
subjective tests and objective measures. Code and 200 pruned models are made
available to facilitate future research on efficiency in TTS.
Related papers
- Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR [13.307889110301502]
We compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training.
We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models.
We achieve the best reported ratio between real and synthetic speech WER to date (1.46), but also find that a large gap remains.
arXiv Detail & Related papers (2024-10-16T06:35:56Z) - DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [9.032701216955497]
We present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders.
Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations.
We scale the training dataset and the model size to 82K hours and 790M parameters, respectively.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis [35.16243386407448]
Bridge-TTS is a novel TTS system that substitutes the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one.
Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram.
arXiv Detail & Related papers (2023-12-06T13:31:55Z) - ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real
Novel View Synthesis via Contrastive Learning [102.46382882098847]
We first investigate the effects of synthetic data in synthetic-to-real novel view synthesis.
We propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints.
Our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS.
arXiv Detail & Related papers (2023-03-20T12:06:14Z) - EPIC TTS Models: Empirical Pruning Investigations Characterizing
Text-To-Speech Models [26.462819114575172]
This work compares sparsity paradigms in text-to-speech synthesis.
It is the first work that compares sparsity paradigms in text-to-speech synthesis.
arXiv Detail & Related papers (2022-09-22T09:47:25Z) - BERT, can HE predict contrastive focus? Predicting and controlling
prominence in neural TTS using a language model [29.188684861193092]
We evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on utterances containing contrastive focus.
We also evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
arXiv Detail & Related papers (2022-07-04T20:43:41Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - A Survey on Neural Speech Synthesis [110.39292386792555]
Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities.
We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends.
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
arXiv Detail & Related papers (2021-06-29T16:50:51Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.