Related papers: On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis

URL: http://arxiv.org/abs/2110.01147v1
Date: Mon, 4 Oct 2021 02:03:28 GMT
Title: On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis
Authors: Cheng-I Jeff Lai, Erica Cooper, Yang Zhang, Shiyu Chang, Kaizhi Qian, Yi-Lun Liao, Yung-Sung Chuang, Alexander H. Liu, Junichi Yamagishi, David Cox, James Glass
Abstract summary: We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
Score: 102.80458458550999
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Are end-to-end text-to-speech (TTS) models over-parametrized? To what extent can these models be pruned, and what happens to their synthesis capabilities? This work serves as a starting point to explore pruning both spectrogram prediction networks and vocoders. We thoroughly investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech. Additionally, we explored several aspects of TTS pruning: amount of finetuning data versus sparsity, TTS-Augmentation to utilize unspoken text, and combining knowledge distillation and pruning. Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility, with similar prosody. All of our experiments are conducted on publicly available models, and findings in this work are backed by large-scale subjective tests and objective measures. Code and 200 pruned models are made available to facilitate future research on efficiency in TTS.

Related papers

Beyond Oversmoothing: Evaluating DDPM and MSE for Scalable Speech Synthesis in ASR [13.307889110301502]
We compare Denoising Diffusion Probabilistic Models (DDPM) to Mean Squared Error (MSE) based models for TTS, when used for ASR model training. We find that for a given model size, DDPM can make better use of more data, and a more diverse set of speakers, than MSE models. We achieve the best reported ratio between real and synthetic speech WER to date (1.46), but also find that a large gap remains.
arXiv Detail & Related papers (2024-10-16T06:35:56Z)
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis [12.310318928818546]
We introduce DMOSpeech, a distilled diffusion-based TTS model that achieves both faster inference and superior performance compared to its teacher model. Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude. This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization.
arXiv Detail & Related papers (2024-10-14T21:17:58Z)
DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [9.032701216955497]
We present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively.
arXiv Detail & Related papers (2024-06-17T11:25:57Z)
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z)
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis [35.16243386407448]
Bridge-TTS is a novel TTS system that substitutes the noisy Gaussian prior in established diffusion-based TTS methods with a clean and deterministic one. Specifically, we leverage the latent representation obtained from text input as our prior, and build a fully tractable Schrodinger bridge between it and the ground-truth mel-spectrogram.
arXiv Detail & Related papers (2023-12-06T13:31:55Z)
ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real Novel View Synthesis via Contrastive Learning [102.46382882098847]
We first investigate the effects of synthetic data in synthetic-to-real novel view synthesis. We propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints. Our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS.
arXiv Detail & Related papers (2023-03-20T12:06:14Z)
EPIC TTS Models: Empirical Pruning Investigations Characterizing Text-To-Speech Models [26.462819114575172]
This work compares sparsity paradigms in text-to-speech synthesis. It is the first work that compares sparsity paradigms in text-to-speech synthesis.
arXiv Detail & Related papers (2022-09-22T09:47:25Z)
BERT, can HE predict contrastive focus? Predicting and controlling prominence in neural TTS using a language model [29.188684861193092]
We evaluate the accuracy of a BERT model, finetuned to predict quantized acoustic prominence features, on utterances containing contrastive focus. We also evaluate the controllability of pronoun prominence in a TTS model conditioned on acoustic prominence features.
arXiv Detail & Related papers (2022-07-04T20:43:41Z)
ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV) We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z)
A Survey on Neural Speech Synthesis [110.39292386792555]
Text to speech (TTS) is a hot research topic in speech, language, and machine learning communities. We conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc.
arXiv Detail & Related papers (2021-06-29T16:50:51Z)
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings. We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.