On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis
- URL: http://arxiv.org/abs/2110.01147v1
- Date: Mon, 4 Oct 2021 02:03:28 GMT
- Title: On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis
- Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian,
Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David
Cox, James Glass
- Abstract summary: We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
- Score: 102.80458458550999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent
can these models be pruned, and what happens to their synthesis capabilities?
This work serves as a starting point to explore pruning both spectrogram
prediction networks and vocoders. We thoroughly investigate the tradeoffs
between sparstiy and its subsequent effects on synthetic speech. Additionally,
we explored several aspects of TTS pruning: amount of finetuning data versus
sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge
distillation and pruning. Our findings suggest that not only are end-to-end TTS
models highly prunable, but also, perhaps surprisingly, pruned TTS models can
produce synthetic speech with equal or higher naturalness and intelligibility,
with similar prosody. All of our experiments are conducted on publicly
available models, and findings in this work are backed by large-scale
subjective tests and objective measures. Code and 200 pruned models are made
available to facilitate future research on efficiency in TTS.
Related papers
- DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis [12.310318928818546]
We introduce DMOSpeech, a distilled diffusion-based TTS model that achieves both faster inference and superior performance compared to its teacher model.
Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude.
This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization.
arXiv Detail & Related papers (2024-10-14T21:17:58Z) - DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors [8.419383213705789]
We introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors.
We find that DiT with minimal modifications outperforms U-Net, variable-length modeling with a speech length predictor, and conditions like semantic alignment in speech latent representations are key to further enhancement.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis [35.16243386407448]
Bridge-TTS is a novel TTS system that substitutes the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one.
Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram.
arXiv Detail & Related papers (2023-12-06T13:31:55Z) - ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real
Novel View Synthesis via Contrastive Learning [102.46382882098847]
We first investigate the effects of synthetic data in synthetic-to-real novel view synthesis.
We propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints.
Our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS.
arXiv Detail & Related papers (2023-03-20T12:06:14Z) - EPIC TTS Models: Empirical Pruning Investigations Characterizing
Text-To-Speech Models [26.462819114575172]
This work compares sparsity paradigms in text-to-speech synthesis.
It is the first work that compares sparsity paradigms in text-to-speech synthesis.
arXiv Detail & Related papers (2022-09-22T09:47:25Z) - BERT, can HE predict contrastive focus? Predicting and controlling
prominence in neural TTS using a language model [29.188684861193092]
We evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on utterances containing contrastive focus.
We also evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
arXiv Detail & Related papers (2022-07-04T20:43:41Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - A Survey on Neural Speech Synthesis [110.39292386792555]
Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities.
We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends.
We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
arXiv Detail & Related papers (2021-06-29T16:50:51Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.