Fully-hierarchical fine-grained prosody modeling for interpretable
speech synthesis
- URL: http://arxiv.org/abs/2002.03785v1
- Date: Thu, 6 Feb 2020 12:52:03 GMT
- Title: Fully-hierarchical fine-grained prosody modeling for interpretable
speech synthesis
- Authors: Guangzhi Sun, Yu Zhang, Ron J. Weiss, Yuan Cao, Heiga Zen, Yonghui Wu
- Abstract summary: This paper proposes a hierarchical, fine-grained and interpretable latent variable model for prosody based on the Tacotron 2 text-to-speech model.
It achieves multi-resolution modeling of prosody by conditioning finer level representations on coarser level ones.
- Score: 42.29094097639594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a hierarchical, fine-grained and interpretable latent
variable model for prosody based on the Tacotron 2 text-to-speech model. It
achieves multi-resolution modeling of prosody by conditioning finer level
representations on coarser level ones. Additionally, it imposes hierarchical
conditioning across all latent dimensions using a conditional variational
auto-encoder (VAE) with an auto-regressive structure. Evaluation of
reconstruction performance illustrates that the new structure does not degrade
the model while allowing better interpretability. Interpretations of prosody
attributes are provided together with the comparison between word-level and
phone-level prosody representations. Moreover, both qualitative and
quantitative evaluations are used to demonstrate the improvement in the
disentanglement of the latent dimensions.
Related papers
- How much do contextualized representations encode long-range context? [10.188367784207049]
We analyze contextual representations in neural autoregressive language models, emphasizing long-range contexts that span several thousand tokens.
Our methodology employs a perturbation setup and the metric emphAnisotropy-Calibrated Cosine Similarity, to capture the degree of contextualization of long-range patterns from the perspective of representation geometry.
arXiv Detail & Related papers (2024-10-16T06:49:54Z) - LLM-based Hierarchical Concept Decomposition for Interpretable Fine-Grained Image Classification [5.8754760054410955]
We introduce textttHi-CoDecomposition, a novel framework designed to enhance model interpretability through structured concept analysis.
Our approach not only aligns with the performance of state-of-the-art models but also advances transparency by providing clear insights into the decision-making process.
arXiv Detail & Related papers (2024-05-29T00:36:56Z) - LaCo: Large Language Model Pruning via Layer Collapse [56.92068213969036]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion.
Existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues.
We propose a concise layer-wise structured pruner called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
arXiv Detail & Related papers (2024-02-17T04:16:30Z) - Model Criticism for Long-Form Text Generation [113.13900836015122]
We apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of generated text.
We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality.
We find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
arXiv Detail & Related papers (2022-10-16T04:35:58Z) - Learning Disentangled Representations for Natural Language Definitions [0.0]
We argue that recurrent syntactic and semantic regularities in textual data can be used to provide the models with both structural biases and generative factors.
We leverage the semantic structures present in a representative and semantically dense category of sentence types, definitional sentences, for training a Variational Autoencoder to learn disentangled representations.
arXiv Detail & Related papers (2022-09-22T14:31:55Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular
Subword Units [19.668440671541546]
In end-to-end automatic speech recognition, a model is expected to implicitly learn representations suitable for recognizing a word-level sequence.
We propose a hierarchical conditional model that is based on connectionist temporal classification ( CTC)
Experimental results on LibriSpeech-100h, 960h and TEDLIUM2 demonstrate that the proposed model improves over a standard CTC-based model.
arXiv Detail & Related papers (2021-10-08T13:15:58Z) - Evaluating the Impact of a Hierarchical Discourse Representation on
Entity Coreference Resolution Performance [3.7277082975620797]
In this work, we leverage automatically constructed discourse parse trees within a neural approach.
We demonstrate a significant improvement on two benchmark entity coreference-resolution datasets.
arXiv Detail & Related papers (2021-04-20T19:14:57Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.