Emphasis control for parallel neural TTS
- URL: http://arxiv.org/abs/2110.03012v1
- Date: Wed, 6 Oct 2021 18:45:39 GMT
- Title: Emphasis control for parallel neural TTS
- Authors: Shreyas Seshadri, Tuomo Raitio, Dan Castellani, Jiangchuan Li
- Abstract summary: The semantic information conveyed by a speech signal is strongly influenced by local variations in prosody.
Recent parallel neural text-to-speech (TTS) methods are able to generate speech with high fidelity while maintaining high performance.
This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis.
- Score: 8.039245267912511
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The semantic information conveyed by a speech signal is strongly influenced
by local variations in prosody. Recent parallel neural text-to-speech (TTS)
synthesis methods are able to generate speech with high fidelity while
maintaining high performance. However, these systems often lack simple control
over the output prosody, thus restricting the semantic information conveyable
for a given text. This paper proposes a hierarchical parallel neural TTS system
for prosodic emphasis control by learning a latent space that directly
corresponds to a change in emphasis. Three candidate features for the latent
space are compared: 1) Variance of pitch and duration within words in a
sentence, 2) a wavelet based feature computed from pitch, energy, and duration
and 3) a learned combination of the above features. Objective measures reveal
that the proposed methods are able to achieve a wide range of emphasis
modification, and subjective evaluations on the degree of emphasis and the
overall quality indicate that they show promise for real-world applications.
Related papers
- NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic
Token Prediction [14.661123738628772]
We introduce a text-to-speech(TTS) framework based on a neural transducer.
We use discretized semantic tokens acquired from wav2vec2.0 embeddings, which makes it easy to adopt a neural transducer for the TTS framework enjoying its monotonic alignment constraints.
arXiv Detail & Related papers (2023-11-06T06:13:39Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Controllable Accented Text-to-Speech Synthesis [76.80549143755242]
We propose a neural TTS architecture that allows us to control the accent and its intensity during inference.
This is the first study of accented TTS synthesis with explicit intensity control.
arXiv Detail & Related papers (2022-09-22T06:13:07Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Hierarchical prosody modeling and control in non-autoregressive parallel
neural TTS [7.531331499935223]
We train a non-autoregressive parallel neural TTS model hierarchically conditioned on coarse and fine-grained acoustic speech features.
Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension.
arXiv Detail & Related papers (2021-10-06T17:58:42Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Controllable neural text-to-speech synthesis using intuitive prosodic
features [3.709803838880226]
We train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions.
Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles.
arXiv Detail & Related papers (2020-09-14T22:37:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.