Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features
- URL: http://arxiv.org/abs/2104.04050v1
- Date: Thu, 8 Apr 2021 20:50:15 GMT
- Title: Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features
- Authors: Mahsa Elyasi, Gaurav Bharaj
- Abstract summary: We propose a strategy for conditioning Tacotron-2 on two fundamental prosodic features in English -- stress syllable and pitch accent.
We show that jointly conditioned features at pre-encoder and intra-decoder stages result in prosodically natural synthesized speech.
- Score: 1.6286844497313562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural sequence-to-sequence text-to-speech synthesis (TTS), such as
Tacotron-2, transforms text into high-quality speech. However, generating
speech with natural prosody still remains a challenge. Yasuda et. al. show that
unlike natural speech, Tacotron-2's encoder doesn't fully represent prosodic
features (e.g. syllable stress in English) from characters, and result in flat
fundamental frequency variations.
In this work, we propose a novel carefully designed strategy for conditioning
Tacotron-2 on two fundamental prosodic features in English -- stress syllable
and pitch accent, that help achieve more natural prosody. To this end, we use
of a classifier to learn these features in an end-to-end fashion, and apply
feature conditioning at three parts of Tacotron-2's Text-To-Mel Spectrogram:
pre-encoder, post-encoder, and intra-decoder. Further, we show that jointly
conditioned features at pre-encoder and intra-decoder stages result in
prosodically natural synthesized speech (vs. Tacotron-2), and allows the model
to produce speech with more accurate pitch accent and stress patterns.
Quantitative evaluations show that our formulation achieves higher
fundamental frequency contour correlation, and lower Mel Cepstral Distortion
measure between synthesized and natural speech. And subjective evaluation shows
that the proposed method's Mean Opinion Score of 4.14 fairs higher than
baseline Tacotron-2, 3.91, when compared against natural speech (LJSpeech
corpus), 4.28.
Related papers
- NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention [0.0]
We propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech.
Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention.
We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances.
arXiv Detail & Related papers (2022-01-25T15:06:07Z) - Emphasis control for parallel neural TTS [8.039245267912511]
The semantic information conveyed by a speech signal is strongly influenced by local variations in prosody.
Recent parallel neural text-to-speech (TTS) methods are able to generate speech with high fidelity while maintaining high performance.
This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis.
arXiv Detail & Related papers (2021-10-06T18:45:39Z) - Using previous acoustic context to improve Text-to-Speech synthesis [30.885417054452905]
We leverage the sequential nature of the data using an acoustic context encoder that produces an embedding of the previous utterance audio.
We compare two secondary tasks: predicting the ordering of utterance pairs, and predicting the embedding of the current utterance audio.
arXiv Detail & Related papers (2020-12-07T15:00:18Z) - Controllable neural text-to-speech synthesis using intuitive prosodic
features [3.709803838880226]
We train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions.
Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles.
arXiv Detail & Related papers (2020-09-14T22:37:44Z) - Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based
TTS [74.11899135025503]
We extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks.
We show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
arXiv Detail & Related papers (2020-08-11T07:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.