Applying Syntax$\unicode{x2013}$Prosody Mapping Hypothesis and Prosodic
Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis
- URL: http://arxiv.org/abs/2203.15276v1
- Date: Tue, 29 Mar 2022 06:45:28 GMT
- Title: Applying Syntax$\unicode{x2013}$Prosody Mapping Hypothesis and Prosodic
Well-Formedness Constraints to Neural Sequence-to-Sequence Speech Synthesis
- Authors: Kei Furukawa, Takeshi Kishiyama, and Satoshi Nakamura
- Abstract summary: End-to-end text-to-speech (TTS) generates speech sounds directly from strings of texts or phonemes.
This study investigates whether it can reproduce rhythmic linguistics based on phonological constraints.
A proposed model efficiently synthesizes phonological phenomena in the test data that were not explicitly included in the training data.
- Score: 7.609330016848916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end text-to-speech synthesis (TTS), which generates speech sounds
directly from strings of texts or phonemes, has improved the quality of speech
synthesis over the conventional TTS. However, most previous studies have been
evaluated based on subjective naturalness and have not objectively examined
whether they can reproduce pitch patterns of phonological phenomena such as
downstep, rhythmic boost, and initial lowering that reflect syntactic
structures in Japanese. These phenomena can be linguistically explained by
phonological constraints and the syntax$\unicode{x2013}$prosody mapping
hypothesis (SPMH), which assumes projections from syntactic structures to
phonological hierarchy. Although some experiments in psycholinguistics have
verified the validity of the SPMH, it is crucial to investigate whether it can
be implemented in TTS. To synthesize linguistic phenomena involving syntactic
or phonological constraints, we propose a model using phonological symbols
based on the SPMH and prosodic well-formedness constraints. Experimental
results showed that the proposed method synthesized similar pitch patterns to
those reported in linguistics experiments for the phenomena of initial lowering
and rhythmic boost. The proposed model efficiently synthesizes phonological
phenomena in the test data that were not explicitly included in the training
data.
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - Perception of Phonological Assimilation by Neural Speech Recognition Models [3.4173734484549625]
This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds.
Using psycholinguistic stimuli, we analyze how various linguistic context cues influence compensation patterns in the model's output.
arXiv Detail & Related papers (2024-06-21T15:58:22Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - How Generative Spoken Language Modeling Encodes Noisy Speech:
Investigation from Phonetics to Syntactics [33.070158866023]
generative spoken language modeling (GSLM) involves using learned symbols derived from data rather than phonemes for speech analysis and synthesis.
This paper presents the findings of GSLM's encoding and decoding effectiveness at the spoken-language and speech levels.
arXiv Detail & Related papers (2023-06-01T14:07:19Z) - Prosody-controllable spontaneous TTS with neural HMMs [11.472325158964646]
We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets.
We add utterance-level prosody control to an existing neural HMM-based TTS system.
We evaluate the system's capability of synthesizing two types of creaky voice.
arXiv Detail & Related papers (2022-11-24T11:06:11Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - On the Interplay Between Sparsity, Naturalness, Intelligibility, and
Prosody in Speech Synthesis [102.80458458550999]
We investigate the tradeoffs between sparstiy and its subsequent effects on synthetic speech.
Our findings suggest that not only are end-to-end TTS models highly prunable, but also, perhaps surprisingly, pruned TTS models can produce synthetic speech with equal or higher naturalness and intelligibility.
arXiv Detail & Related papers (2021-10-04T02:03:28Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z) - Prosody Learning Mechanism for Speech Synthesis System Without Text
Length Limit [39.258370942013165]
A prosody learning mechanism is proposed to model the prosody of speech based on TTS system.
A novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length.
Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model.
arXiv Detail & Related papers (2020-08-13T02:54:50Z) - Constructing a Family Tree of Ten Indo-European Languages with
Delexicalized Cross-linguistic Transfer Patterns [57.86480614673034]
We formalize the delexicalized transfer as interpretable tree-to-string and tree-to-tree patterns.
This allows us to quantitatively probe cross-linguistic transfer and extend inquiries of Second Language Acquisition.
arXiv Detail & Related papers (2020-07-17T15:56:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.