RWEN-TTS: Relation-aware Word Encoding Network for Natural
Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2212.07939v1
- Date: Thu, 15 Dec 2022 16:17:03 GMT
- Title: RWEN-TTS: Relation-aware Word Encoding Network for Natural
Text-to-Speech Synthesis
- Authors: Shinhyeok Oh, HyeongRae Noh, Yoonseok Hong, Insoo Oh
- Abstract summary: A huge number of text-to-speech (TTS) models produce human-like speech.
Relation-aware Word Network (RWEN) effectively allows syntactic and semantic information based on two modules.
Experimental results show substantial improvements compared to previous works.
- Score: 3.591224588041813
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advent of deep learning, a huge number of text-to-speech (TTS)
models which produce human-like speech have emerged. Recently, by introducing
syntactic and semantic information w.r.t the input text, various approaches
have been proposed to enrich the naturalness and expressiveness of TTS models.
Although these strategies showed impressive results, they still have some
limitations in utilizing language information. First, most approaches only use
graph networks to utilize syntactic and semantic information without
considering linguistic features. Second, most previous works do not explicitly
consider adjacent words when encoding syntactic and semantic information, even
though it is obvious that adjacent words are usually meaningful when encoding
the current word. To address these issues, we propose Relation-aware Word
Encoding Network (RWEN), which effectively allows syntactic and semantic
information based on two modules (i.e., Semantic-level Relation Encoding and
Adjacent Word Relation Encoding). Experimental results show substantial
improvements compared to previous works.
Related papers
- Improving Neural Biasing for Contextual Speech Recognition by Early Context Injection and Text Perturbation [27.057810339120664]
We propose two techniques to improve context-aware ASR models.
On LibriSpeech, our techniques together reduce the rare word error rate by 60% and 25% relatively compared to no biasing and shallow fusion.
On SPGISpeech and a real-world dataset ConEC, our techniques also yield good improvements over the baselines.
arXiv Detail & Related papers (2024-07-14T19:32:33Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Language-Oriented Communication with Semantic Coding and Knowledge
Distillation for Text-to-Image Generation [53.97155730116369]
We put forward a novel framework of language-oriented semantic communication (LSC)
In LSC, machines communicate using human language messages that can be interpreted and manipulated via natural language processing (NLP) techniques for SC efficiency.
We introduce three innovative algorithms: 1) semantic source coding (SSC), which compresses a text prompt into its key head words capturing the prompt's syntactic essence; 2) semantic channel coding ( SCC), that improves robustness against errors by substituting head words with their lenghthier synonyms; and 3) semantic knowledge distillation (SKD), that produces listener-customized prompts via in-context learning the listener's
arXiv Detail & Related papers (2023-09-20T08:19:05Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks [8.683116789109462]
We propose that the most basic syntactic operations can be modeled directly from raw speech in a fully unsupervised way.
We introduce spontaneous concatenation: a phenomenon where convolutional neural networks (CNNs) trained on acoustic recordings of individual words start generating outputs with two or even three words.
arXiv Detail & Related papers (2023-05-02T17:38:21Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - More Romanian word embeddings from the RETEROM project [0.0]
"word embeddings" are automatically learned vector representations of words.
We plan to develop an openaccess large library of ready-to-use word embeddings sets.
arXiv Detail & Related papers (2021-11-21T06:05:12Z) - Dependency Parsing based Semantic Representation Learning with Graph
Neural Network for Enhancing Expressiveness of Text-to-Speech [49.05471750563229]
We propose a semantic representation learning method based on graph neural network, considering dependency relations of a sentence.
We show that our proposed method outperforms the baseline using vanilla BERT features both in LJSpeech and Bilzzard Challenge 2013 datasets.
arXiv Detail & Related papers (2021-04-14T13:09:51Z) - GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech
Synthesis [79.1885389845874]
Transformer-based end-to-end text-to-speech synthesis (TTS) is one of such successful implementations.
We propose a novel neural TTS model, denoted as GraphSpeech, that is formulated under graph neural network framework.
Experiments show that GraphSpeech consistently outperforms the Transformer TTS baseline in terms of spectrum and prosody rendering of utterances.
arXiv Detail & Related papers (2020-10-23T14:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.