A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion
Analysis
- URL: http://arxiv.org/abs/2309.11849v1
- Date: Thu, 21 Sep 2023 07:45:44 GMT
- Title: A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion
Analysis
- Authors: Xianhao Wei, Jia Jia, Xiang Li, Zhiyong Wu, Ziyi Wang
- Abstract summary: This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text.
We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features.
- Score: 19.271542595753267
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper explores predicting suitable prosodic features for fine-grained
emotion analysis from the discourse-level text. To obtain fine-grained
emotional prosodic features as predictive values for our model, we extract a
phoneme-level Local Prosody Embedding sequence (LPEs) and a Global Style
Embedding as prosodic speech features from the speech with the help of a style
transfer model. We propose a Discourse-level Multi-scale text Prosodic Model
(D-MPM) that exploits multi-scale text to predict these two prosodic features.
The proposed model can be used to analyze better emotional prosodic features
and thus guide the speech synthesis model to synthesize more expressive speech.
To quantitatively evaluate the proposed model, we contribute a new and
large-scale Discourse-level Chinese Audiobook (DCA) dataset with more than
13,000 utterances annotated sequences to evaluate the proposed model.
Experimental results on the DCA dataset show that the multi-scale text
information effectively helps to predict prosodic features, and the
discourse-level text improves both the overall coherence and the user
experience. More interestingly, although we aim at the synthesis effect of the
style transfer model, the synthesized speech by the proposed text prosodic
analysis model is even better than the style transfer from the original speech
in some user evaluation indicators.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised
Style Extractor and Hierarchical Modeling in Speech Synthesis [37.65745551401636]
Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesized speech of a target speaker's timbre.
In most previous methods, the synthesized fine-grained prosody features often represent the source speaker's average style.
A strength-controlled semi-supervised style extractor is proposed to disentangle the style from content and timbre.
arXiv Detail & Related papers (2023-03-14T08:52:58Z) - Predicting phoneme-level prosody latents using AR and flow-based Prior
Networks for expressive speech synthesis [3.6159128762538018]
We show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality.
We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.
arXiv Detail & Related papers (2022-11-02T17:45:01Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [18.812696623555855]
We present a novel few shot multi-speaker speech synthesis approach (FSM-SS)
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner.
We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
arXiv Detail & Related papers (2020-12-14T04:37:07Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z) - Prosody Learning Mechanism for Speech Synthesis System Without Text
Length Limit [39.258370942013165]
A prosody learning mechanism is proposed to model the prosody of speech based on TTS system.
A novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length.
Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model.
arXiv Detail & Related papers (2020-08-13T02:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.