A Study of Modeling Rising Intonation in Cantonese Neural Speech
Synthesis
- URL: http://arxiv.org/abs/2208.02189v1
- Date: Wed, 3 Aug 2022 16:21:08 GMT
- Title: A Study of Modeling Rising Intonation in Cantonese Neural Speech
Synthesis
- Authors: Qibing Bai, Tom Ko, Yu Zhang
- Abstract summary: Declarative questions are commonly used in daily Cantonese conversations.
Vanilla neural text-to-speech (TTS) systems are not capable of synthesizing rising intonation for these sentences.
We propose to complement the Cantonese TTS model with a BERT-based statement/question classifier.
- Score: 10.747119651974947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In human speech, the attitude of a speaker cannot be fully expressed only by
the textual content. It has to come along with the intonation. Declarative
questions are commonly used in daily Cantonese conversations, and they are
usually uttered with rising intonation. Vanilla neural text-to-speech (TTS)
systems are not capable of synthesizing rising intonation for these sentences
due to the loss of semantic information. Though it has become more common to
complement the systems with extra language models, their performance in
modeling rising intonation is not well studied. In this paper, we propose to
complement the Cantonese TTS model with a BERT-based statement/question
classifier. We design different training strategies and compare their
performance. We conduct our experiments on a Cantonese corpus named CanTTS.
Empirical results show that the separate training approach obtains the best
generalization performance and feasibility.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation [6.225927189801006]
We propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns.
Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences.
arXiv Detail & Related papers (2024-04-03T09:17:38Z) - Syllable based DNN-HMM Cantonese Speech to Text System [3.976127530758402]
This paper builds up a Cantonese Speech-to-Text (STT) system with a syllable based acoustic model.
The ONC-based syllable acoustic modeling achieves the best performance with the word error rate (WER) of 9.66% and the real time factor (RTF) of 1.38812.
arXiv Detail & Related papers (2024-02-13T20:54:24Z) - SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic
Organization in HuBERT [49.06057768982775]
We show that a syllabic organization emerges in learning sentence-level representation of speech.
We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
arXiv Detail & Related papers (2023-10-16T20:05:36Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - QI-TTS: Questioning Intonation Control for Emotional Speech Synthesis [29.962519978925236]
We propose QI-TTS which aims to better transfer and control intonation to further deliver the speaker's questioning intention.
We propose a multi-style extractor to extract style embedding from two different levels.
Experiments have validated the effectiveness of QI-TTS for improving intonation in emotional speech.
arXiv Detail & Related papers (2023-03-14T07:53:19Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural
Machine Translation [6.090922774386845]
We propose a novel Chinese dialect TTS with a translation module.
It helps to convert Mandarin text into idiomatic expressions with correct orthography and grammar.
It is the first known work to incorporate translation with TTS.
arXiv Detail & Related papers (2022-06-10T07:46:34Z) - Into-TTS : Intonation Template based Prosody Control System [17.68906373821669]
Intonations take an important role in delivering the intention of the speaker.
Current end-to-end TTS systems often fail to model proper intonations.
We propose a novel, intuitive method to synthesize speech in different intonations.
arXiv Detail & Related papers (2022-04-04T06:37:19Z) - Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based
TTS [74.11899135025503]
We extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks.
We show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
arXiv Detail & Related papers (2020-08-11T07:57:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.