Applying Phonological Features in Multilingual Text-To-Speech
- URL: http://arxiv.org/abs/2110.03609v2
- Date: Sun, 10 Oct 2021 11:45:04 GMT
- Title: Applying Phonological Features in Multilingual Text-To-Speech
- Authors: Cong Zhang, Huinan Zeng, Huang Liu, Jiewen Zheng
- Abstract summary: We present a mapping of ARPABET/pinyin to SAMPA/SAMPA-SC and then to phonological features.
We tested whether this mapping could lead to the successful generation of native, non-native, and code-switched speech in the two languages.
- Score: 2.567123525861164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study investigates whether phonological features can be applied in
text-to-speech systems to generate native and non-native speech in English and
Mandarin. We present a mapping of ARPABET/pinyin to SAMPA/SAMPA-SC and then to
phonological features. We tested whether this mapping could lead to the
successful generation of native, non-native, and code-switched speech in the
two languages. We ran two experiments, one with a small dataset and one with a
larger dataset. The results proved that phonological features could be used as
a feasible input system, although further investigation is needed to improve
model performance. The accented output generated by the TTS models also helps
with understanding human second language acquisition processes.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Applying Feature Underspecified Lexicon Phonological Features in
Multilingual Text-to-Speech [1.9688095374610102]
We present a mapping of ARPABET/pinyin to SAMPA/SAMPA-SC and then to phonological features.
This mapping was tested for whether it could lead to the successful generation of native, non-native, and code-switched speech in the two languages.
arXiv Detail & Related papers (2022-04-14T21:04:55Z) - WLASL-LEX: a Dataset for Recognising Phonological Properties in American
Sign Language [2.814213966364155]
We build a large-scale dataset of American Sign Language signs annotated with six different phonological properties.
We investigate whether data-driven end-to-end and feature-based approaches can be optimised to automatically recognise these properties.
arXiv Detail & Related papers (2022-03-11T17:21:24Z) - Cross-lingual Low Resource Speaker Adaptation Using Phonological
Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages.
With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z) - Acoustics Based Intent Recognition Using Discovered Phonetic Units for
Low Resource Languages [51.0542215642794]
We propose a novel acoustics based intent recognition system that uses discovered phonetic units for intent classification.
We present results for two languages families - Indic languages and Romance languages, for two different intent recognition tasks.
arXiv Detail & Related papers (2020-11-07T00:35:31Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z) - Investigation of learning abilities on linguistic features in
sequence-to-sequence text-to-speech synthesis [48.151894340550385]
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes.
We investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English.
arXiv Detail & Related papers (2020-05-20T23:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.