Unify and Conquer: How Phonetic Feature Representation Affects Polyglot
Text-To-Speech (TTS)
- URL: http://arxiv.org/abs/2207.01547v1
- Date: Mon, 4 Jul 2022 16:14:57 GMT
- Title: Unify and Conquer: How Phonetic Feature Representation Affects Polyglot
Text-To-Speech (TTS)
- Authors: Ariadna Sanchez, Alessio Falai, Ziyao Zhang, Orazio Angelini, Kayoko
Yanagisawa
- Abstract summary: unified representations consistently achieves better cross-lingual synthesis with respect to both naturalness and accent.
Separate representations tend to have an order of magnitude more tokens than unified ones, which may affect model capacity.
- Score: 3.57486761615991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An essential design decision for multilingual Neural Text-To-Speech (NTTS)
systems is how to represent input linguistic features within the model. Looking
at the wide variety of approaches in the literature, two main paradigms emerge,
unified and separate representations. The former uses a shared set of phonetic
tokens across languages, whereas the latter uses unique phonetic tokens for
each language. In this paper, we conduct a comprehensive study comparing
multilingual NTTS systems models trained with both representations. Our results
reveal that the unified approach consistently achieves better cross-lingual
synthesis with respect to both naturalness and accent. Separate representations
tend to have an order of magnitude more tokens than unified ones, which may
affect model capacity. For this reason, we carry out an ablation study to
understand the interaction of the representation type with the size of the
token embedding. We find that the difference between the two paradigms only
emerges above a certain threshold embedding size. This study provides strong
evidence that unified representations should be the preferred paradigm when
building multilingual NTTS systems.
Related papers
- How do Multimodal Foundation Models Encode Text and Speech? An Analysis of Cross-Lingual and Cross-Modal Representations [17.528100902591056]
Cross-modal representations converge over model layers, except in the initial layers specialized at text and speech processing.
Speech exhibits larger cross-lingual differences than text.
For models not explicitly trained for modality-agnostic representations, the modality gap is more prominent than the language gap.
arXiv Detail & Related papers (2024-11-26T18:29:11Z) - Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement [1.4335183427838039]
We take the approach of developing curated synthetic data on a large scale, with specific properties.
We use a new multiple-choice task and datasets, Blackbird Language Matrices, to focus on a specific grammatical structural phenomenon.
We show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences.
arXiv Detail & Related papers (2024-09-10T14:58:55Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - MAESTRO: Matched Speech Text Representations through Modality Matching [35.566604806335626]
Maestro is a self-supervised training method to unify representations learnt from speech and text modalities.
We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 11% relative reduction in Word Error Rate (WER)
We establish a new state-of-the-art (SOTA) on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages.
arXiv Detail & Related papers (2022-04-07T12:48:16Z) - Cross-lingual Low Resource Speaker Adaptation Using Phonological
Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages.
With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z) - Exploring Teacher-Student Learning Approach for Multi-lingual
Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages.
We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.