QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis
- URL: http://arxiv.org/abs/2303.07682v1
- Date: Tue, 14 Mar 2023 07:53:19 GMT
- Title: QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis
- Authors: Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao
- Abstract summary: We propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention.
We propose a multi-style extractor to extract style embedding from two different levels.
Experiments have validated the effectiveness of QI-TTS for improving intonation in emotional speech.
- Score: 29.962519978925236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent expressive text to speech (TTS) models focus on synthesizing emotional
speech, but some fine-grained styles such as intonation are neglected. In this
paper, we propose QI-TTS which aims to better transfer and control intonation
to further deliver the speaker's questioning intention while transferring
emotion from reference speech. We propose a multi-style extractor to extract
style embedding from two different levels. While the sentence level represents
emotion, the final syllable level represents intonation. For fine-grained
intonation control, we use relative attributes to represent intonation
intensity at the syllable level.Experiments have validated the effectiveness of
QI-TTS for improving intonation expressiveness in emotional speech synthesis.
Related papers
- MM-TTS: A Unified Framework for Multimodal, Prompt-Induced Emotional Text-to-Speech Synthesis [70.06396781553191]
Multimodal Emotional Text-to-Speech System (MM-TTS) is a unified framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
MM-TTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align), which employs contrastive learning to align emotional features across text, audio, and visual modalities, and the Emotion Embedding-Induced TTS (EMI-TTS), which integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Fine-grained Emotional Control of Text-To-Speech: Learning To Rank
Inter- And Intra-Class Emotion Intensities [1.4986031916712106]
State-of-the-art Text-To-Speech (TTS) models are capable of producing high-quality speech.
We propose a fine-grained controllable emotional TTS, that considers both inter- and intra-class distances.
Our experiments demonstrate that our model exceeds two state-of-the-art controllable TTS models for controllability, emotion and naturalness.
arXiv Detail & Related papers (2023-03-02T09:09:03Z) - Time out of Mind: Generating Rate of Speech conditioned on emotion and
speaker [0.0]
We train a GAN conditioned on emotion to generate worth lengths for a given input text.
These word lengths are relative neutral speech and can be provided to a text-to-speech system to generate more expressive speech.
We were able to achieve better performances on objective measures for neutral speech, and better time alignment for happy speech when compared to an out-of-box model.
arXiv Detail & Related papers (2023-01-29T02:58:01Z) - A Study of Modeling Rising Intonation in Cantonese Neural Speech
Synthesis [10.747119651974947]
Declarative questions are commonly used in daily Cantonese conversations.
Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences.
We propose to complement the Cantonese TTS model with a BERT-based statement/question classifier.
arXiv Detail & Related papers (2022-08-03T16:21:08Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Emotional Prosody Control for Speech Generation [7.66200737962746]
We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space.
The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
arXiv Detail & Related papers (2021-11-07T08:52:04Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.