Prosodic Representation Learning and Contextual Sampling for Neural
Text-to-Speech
- URL: http://arxiv.org/abs/2011.02252v1
- Date: Wed, 4 Nov 2020 12:20:21 GMT
- Title: Prosodic Representation Learning and Contextual Sampling for Neural
Text-to-Speech
- Authors: Sri Karlapati, Ammar Abbas, Zack Hodari, Alexis Moinet, Arnaud Joly,
Penny Karanasou, Thomas Drugman
- Abstract summary: We introduce Kathaka, a model trained with a novel two-stage training process for neural speech synthesis.
We learn a prosodic distribution at the sentence level from mel-spectrograms available during training.
In Stage II, we propose a novel method to sample from this learnt prosodic distribution using the contextual information available in text.
- Score: 16.45773135100367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce Kathaka, a model trained with a novel two-stage
training process for neural speech synthesis with contextually appropriate
prosody. In Stage I, we learn a prosodic distribution at the sentence level
from mel-spectrograms available during training. In Stage II, we propose a
novel method to sample from this learnt prosodic distribution using the
contextual information available in text. To do this, we use BERT on text, and
graph-attention networks on parse trees extracted from text. We show a
statistically significant relative improvement of $13.2\%$ in naturalness over
a strong baseline when compared to recordings. We also conduct an ablation
study on variations of our sampling technique, and show a statistically
significant improvement over the baseline in each case.
Related papers
- Seek for Incantations: Towards Accurate Text-to-Image Diffusion
Synthesis through Prompt Engineering [118.53208190209517]
We propose a framework to learn the proper textual descriptions for diffusion models through prompt learning.
Our method can effectively learn the prompts to improve the matches between the input text and the generated images.
arXiv Detail & Related papers (2024-01-12T03:46:29Z) - Adversarial Capsule Networks for Romanian Satire Detection and Sentiment
Analysis [0.13048920509133807]
Satire detection and sentiment analysis are intensively explored natural language processing tasks.
In languages with fewer research resources, an alternative is to produce artificial examples based on character-level adversarial processes.
In this work, we improve the well-known NLP models with adversarial training and capsule networks.
The proposed framework outperforms the existing methods for the two tasks, achieving up to 99.08% accuracy.
arXiv Detail & Related papers (2023-06-13T15:23:44Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Prompt-based Learning for Text Readability Assessment [0.4757470449749875]
We propose the novel adaptation of a pre-trained seq2seq model for readability assessment.
We prove that a seq2seq model can be adapted to discern which text is more difficult from two given texts (pairwise)
arXiv Detail & Related papers (2023-02-25T18:39:59Z) - NapSS: Paragraph-level Medical Text Simplification via Narrative
Prompting and Sentence-matching Summarization [46.772517928718216]
We propose a summarize-then-simplify two-stage strategy, which we call NapSS.
NapSS identifies the relevant content to simplify while ensuring that the original narrative flow is preserved.
Our model achieves significantly better than the seq2seq baseline on an English medical corpus.
arXiv Detail & Related papers (2023-02-11T02:20:25Z) - An Explanation of In-context Learning as Implicit Bayesian Inference [117.19809377740188]
We study the role of the pretraining distribution on the emergence of in-context learning.
We prove that in-context learning occurs implicitly via Bayesian inference of the latent concept.
We empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
arXiv Detail & Related papers (2021-11-03T09:12:33Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z) - On Sampling-Based Training Criteria for Neural Language Modeling [97.35284042981675]
We consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation.
We show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities.
Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim.
arXiv Detail & Related papers (2021-04-21T12:55:52Z) - Neural Data-to-Text Generation with LM-based Text Augmentation [27.822282190362856]
We show that a weakly supervised training paradigm is able to outperform fully supervised seq2seq models with less than 10% annotations.
By utilizing all annotated data, our model can boost the performance of a standard seq2seq model by over 5 BLEU points.
arXiv Detail & Related papers (2021-02-06T10:21:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.