Related papers: Syntactic representation learning for neural network based TTS with syntactic parse tree traversal

Syntactic representation learning for neural network based TTS with syntactic parse tree traversal

URL: http://arxiv.org/abs/2012.06971v1
Date: Sun, 13 Dec 2020 05:52:07 GMT
Title: Syntactic representation learning for neural network based TTS with syntactic parse tree traversal
Authors: Changhe Song, Jingbei Li, Yixuan Zhou, Zhiyong Wu, Helen Meng
Abstract summary: We propose a syntactic representation learning method based on syntactic parse tree to automatically utilize the syntactic structure information. Experimental results demonstrate the effectiveness of our proposed approach. For sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.
Score: 49.05471750563229
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Syntactic structure of a sentence text is correlated with the prosodic structure of the speech that is crucial for improving the prosody and naturalness of a text-to-speech (TTS) system. Nowadays TTS systems usually try to incorporate syntactic structure information with manually designed features based on expert knowledge. In this paper, we propose a syntactic representation learning method based on syntactic parse tree traversal to automatically utilize the syntactic structure information. Two constituent label sequences are linearized through left-first and right-first traversals from constituent parse tree. Syntactic representations are then extracted at word level from each constituent label sequence by a corresponding uni-directional gated recurrent unit (GRU) network. Meanwhile, nuclear-norm maximization loss is introduced to enhance the discriminability and diversity of the embeddings of constituent labels. Upsampled syntactic representations and phoneme embeddings are concatenated to serve as the encoder input of Tacotron2. Experimental results demonstrate the effectiveness of our proposed approach, with mean opinion score (MOS) increasing from 3.70 to 3.82 and ABX preference exceeding by 17% compared with the baseline. In addition, for sentences with multiple syntactic parse trees, prosodic differences can be clearly perceived from the synthesized speeches.

Related papers

GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture [12.303324248639266]
We propose a text-to-speech generation approach optimized via a novel dual-branch ArchiTecture (GOAT-TTS) GOAT-TTS combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models.
arXiv Detail & Related papers (2025-04-15T01:44:56Z)
Syntactic Complexity Identification, Measurement, and Reduction Through Controlled Syntactic Simplification [0.0]
We present a classical syntactic dependency-based approach to split and rephrase a compound and complex sentence into a set of simplified sentences. The paper also introduces an algorithm to identify and measure a sentence's syntactic complexity. This work is accepted and presented in International workshop on Learning with Knowledge Graphs (IWLKG) at WSDM-2023 Conference.
arXiv Detail & Related papers (2023-04-16T13:13:58Z)
Syntactic Structure Processing in the Brain while Listening [3.735055636181383]
There are two popular syntactic parsing methods: constituency and dependency parsing. Recent works have used syntactic embeddings based on constituency trees, incremental top-down parsing, and other word syntactic features for brain activity prediction given the text stimuli to study how the syntax structure is represented in the brain's language network. We investigate the predictive power of the brain encoding models in three settings: (i) individual performance of the constituency and dependency syntactic parsing based embedding methods, (ii) efficacy of these syntactic parsing based embedding methods when controlling for basic syntactic signals, and (iii) relative effectiveness of each of the syntactic embedding methods when controlling for
arXiv Detail & Related papers (2023-02-16T21:28:11Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures. We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees. Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z)
Syntactic Perturbations Reveal Representational Correlates of Hierarchical Phrase Structure in Pretrained Language Models [22.43510769150502]
It is not entirely clear what aspects of sentence-level syntax are captured by vector-based language representations. We show that Transformers build sensitivity to larger parts of the sentence along their layers, and that hierarchical phrase structure plays a role in this process.
arXiv Detail & Related papers (2021-04-15T16:30:31Z)
Dependency Parsing based Semantic Representation Learning with Graph Neural Network for Enhancing Expressiveness of Text-to-Speech [49.05471750563229]
We propose a semantic representation learning method based on graph neural network, considering dependency relations of a sentence. We show that our proposed method outperforms the baseline using vanilla BERT features both in LJSpeech and Bilzzard Challenge 2013 datasets.
arXiv Detail & Related papers (2021-04-14T13:09:51Z)
GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis [79.1885389845874]
Transformer-based end-to-end text-to-speech synthesis (TTS) is one of such successful implementations. We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework. Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.
arXiv Detail & Related papers (2020-10-23T14:14:06Z)
Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining. We distill the approximate marginal distribution over words in context from the syntactic LM. Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z)
Representations of Syntax [MASK] Useful: Effects of Constituency and Dependency Structure in Recursive LSTMs [26.983602540576275]
Sequence-based neural networks show significant sensitivity to syntactic structure, but they still perform less well on syntactic tasks than tree-based networks. We evaluate which of these two representational schemes more effectively introduces biases for syntactic structure. We show that a constituency-based network generalizes more robustly than a dependency-based one, and that combining the two types of structure does not yield further improvement.
arXiv Detail & Related papers (2020-04-30T18:00:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.