DPP-TTS: Diversifying prosodic features of speech via determinantal
point processes
- URL: http://arxiv.org/abs/2310.14663v1
- Date: Mon, 23 Oct 2023 07:59:46 GMT
- Title: DPP-TTS: Diversifying prosodic features of speech via determinantal
point processes
- Authors: Seongho Joo, Hyukhun Koh, Kyomin Jung
- Abstract summary: We propose DPP-TTS: a text-to-speech model based on Determinantal Point Processes (DPPs) with a prosody diversifying module.
Our TTS model is capable of generating speech samples that simultaneously consider perceptual diversity in each sample and among multiple samples.
- Score: 16.461724709212863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid advancement in deep generative models, recent neural
Text-To-Speech(TTS) models have succeeded in synthesizing human-like speech.
There have been some efforts to generate speech with various prosody beyond
monotonous prosody patterns. However, previous works have several limitations.
First, typical TTS models depend on the scaled sampling temperature for
boosting the diversity of prosody. Speech samples generated at high sampling
temperatures often lack perceptual prosodic diversity, which can adversely
affect the naturalness of the speech. Second, the diversity among samples is
neglected since the sampling procedure often focuses on a single speech sample
rather than multiple ones. In this paper, we propose DPP-TTS: a text-to-speech
model based on Determinantal Point Processes (DPPs) with a prosody diversifying
module. Our TTS model is capable of generating speech samples that
simultaneously consider perceptual diversity in each sample and among multiple
samples. We demonstrate that DPP-TTS generates speech samples with more
diversified prosody than baselines in the side-by-side comparison test
considering the naturalness of speech at the same time.
Related papers
- NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Prosody-controllable spontaneous TTS with neural HMMs [11.472325158964646]
We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets.
We add utterance-level prosody control to an existing neural HMM-based TTS system.
We evaluate the system's capability of synthesizing two types of creaky voice.
arXiv Detail & Related papers (2022-11-24T11:06:11Z) - StyleTTS: A Style-Based Generative Model for Natural and Diverse
Text-to-Speech Synthesis [23.17929822987861]
StyleTTS is a style-based generative model for parallel TTS that can synthesize diverse speech with natural prosody from a reference speech utterance.
Our method significantly outperforms state-of-the-art models on both single and multi-speaker datasets.
arXiv Detail & Related papers (2022-05-30T21:34:40Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [18.812696623555855]
We present a novel few shot multi-speaker speech synthesis approach (FSM-SS)
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner.
We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
arXiv Detail & Related papers (2020-12-14T04:37:07Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.