A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin
- URL: http://arxiv.org/abs/2409.07891v3
- Date: Sat, 19 Oct 2024 15:01:35 GMT
- Title: A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin
- Authors: Xiaoyun Jin, Mirjam Ernestus, R. Harald Baayen,
- Abstract summary: We analyze the F0 contours of 3824 tokens of 63 different word types in a spontaneous Taiwan Mandarin corpus.
We show that the tonal context substantially modify a word's canonical tone.
We also show that word, and even more so, word sense, co-determine words' F0 contours.
- Score: 3.072340427031969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In Mandarin, the tonal contours of monosyllabic words produced in isolation or in careful speech are characterized by four lexical tones: a high-level tone (T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However, in spontaneous speech, the actual tonal realization of monosyllabic words can deviate significantly from these canonical tones due to intra-syllabic co-articulation and inter-syllabic co-articulation with adjacent tones. In addition, Chuang et al. (2024) recently reported that the tonal contours of disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their meanings. Following up on their research, we present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of contextual predictors on the one hand, and the way in words' meanings co-determine pitch contours on the other hand. We analyze the F0 contours of 3824 tokens of 63 different word types in a spontaneous Taiwan Mandarin corpus, using the generalized additive (mixed) model to decompose a given observed pitch contour into a set of component pitch contours. We show that the tonal context substantially modify a word's canonical tone. Once the effect of tonal context is controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as a high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0), which in standard descriptions, is realized based on the preceding tone, emerges as a low tone in its own right, modified by the other predictors in the same way as the standard tones T1, T2, T3, and T4. We also show that word, and even more so, word sense, co-determine words' F0 contours. Analyses of variable importance using random forests further supported the substantial effect of tonal context and an effect of word sense.
Related papers
- The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling [1.7723990552388866]
This study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones.
The results show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.
arXiv Detail & Related papers (2025-03-29T17:39:55Z) - Form and meaning co-determine the realization of tone in Taiwan Mandarin spontaneous speech: the case of Tone 3 sandhi [1.7723990552388866]
In Standard Chinese, Tone 3 (the dipping tone) becomes Tone 2 (rising tone) when followed by another Tone 3.
Previous studies have noted that this sandhi process may be incomplete, in the sense that the assimilated Tone 3 is still distinct from a true Tone 2.
The present study investigates the pitch contours of two-character words with T2-T3 and T3-T3 tone patterns in spontaneous Taiwan Mandarin conversations.
arXiv Detail & Related papers (2024-08-28T12:25:45Z) - Word-specific tonal realizations in Mandarin [0.9249657468385781]
This study shows that tonal realization is also partially determined by words' meanings.
We first show, on the basis of a Taiwan corpus of spontaneous conversations, that word type is a stronger predictor of pitch realization than all the previously established word-form related predictors combined.
We then proceed to show, using computational modeling with context-specific word embeddings, that token-specific pitch contours predict word type with 50% accuracy on held-out data.
arXiv Detail & Related papers (2024-05-11T13:00:35Z) - Explicit Intensity Control for Accented Text-to-speech [65.35831577398174]
How to control the intensity of accent in the process of TTS is a very interesting research direction.
Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity.
This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
arXiv Detail & Related papers (2022-10-27T12:23:41Z) - Cross-strait Variations on Two Near-synonymous Loanwords xie2shang1 and
tan2pan4: A Corpus-based Comparative Study [2.6194322370744305]
This study attempts to investigate cross-strait variations on two typical synonymous loanwords in Chinese, i.e. xie2shang1 and tan2pan4.
Through a comparative analysis, the study found some distributional, eventual, and contextual similarities and differences across Taiwan and Mainland Mandarin.
arXiv Detail & Related papers (2022-10-09T04:10:58Z) - Controllable Accented Text-to-Speech Synthesis [76.80549143755242]
We propose a neural TTS architecture that allows us to control the accent and its intensity during inference.
This is the first study of accented TTS synthesis with explicit intensity control.
arXiv Detail & Related papers (2022-09-22T06:13:07Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Phrase break prediction with bidirectional encoder representations in
Japanese text-to-speech synthesis [8.391631335854457]
We propose a novel phrase break prediction method that combines implicit features extracted from a pre-trained large language model, a.k.a BERT, and explicit features extracted from BiLSTM with linguistic features.
The proposed method takes account of both representations to extract the latent semantics, which cannot be captured by previous methods.
arXiv Detail & Related papers (2021-04-26T08:29:29Z) - Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features [1.6286844497313562]
We propose a strategy for conditioning Tacotron-2 on two fundamental prosodic features in English -- stress syllable and pitch accent.
We show that jointly conditioned features at pre-encoder and intra-decoder stages result in prosodically natural synthesized speech.
arXiv Detail & Related papers (2021-04-08T20:50:15Z) - Decomposing lexical and compositional syntax and semantics with deep
language models [82.81964713263483]
The activations of language transformers like GPT2 have been shown to linearly map onto brain activity during speech comprehension.
Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four classes: lexical, compositional, syntactic, and semantic representations.
The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices.
arXiv Detail & Related papers (2021-03-02T10:24:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.