Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with
Articulatory Features
- URL: http://arxiv.org/abs/2203.03191v1
- Date: Mon, 7 Mar 2022 07:58:01 GMT
- Title: Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with
Articulatory Features
- Authors: Florian Lux, Ngoc Thang Vu
- Abstract summary: In this work, we use embeddings derived from articulatory vectors rather than embeddings derived from phoneme identities to learn phoneme representations that hold across languages.
This enables us to fine-tune a high-quality text-to-speech model on just 30 minutes of data in a previously unseen language spoken by a previously unseen speaker.
- Score: 30.37026279162593
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While neural text-to-speech systems perform remarkably well in high-resource
scenarios, they cannot be applied to the majority of the over 6,000 spoken
languages in the world due to a lack of appropriate training data. In this
work, we use embeddings derived from articulatory vectors rather than
embeddings derived from phoneme identities to learn phoneme representations
that hold across languages. In conjunction with language agnostic meta
learning, this enables us to fine-tune a high-quality text-to-speech model on
just 30 minutes of data in a previously unseen language spoken by a previously
unseen speaker.
Related papers
- Meta Learning Text-to-Speech Synthesis in over 7000 Languages [29.17020696379219]
In this work, we take on the challenging task of building a single text-to-speech synthesis system capable of generating speech in over 7000 languages.
By leveraging a novel integration of massively multilingual pretraining and meta learning, our approach enables zero-shot speech synthesis in languages without any available data.
We aim to empower communities with limited linguistic resources and foster further innovation in the field of speech technology.
arXiv Detail & Related papers (2024-06-10T15:56:52Z) - Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Lip Reading for Low-resource Languages by Learning and Combining General
Speech Knowledge and Language-specific Knowledge [57.38948190611797]
This paper proposes a novel lip reading framework, especially for low-resource languages.
Since low-resource languages do not have enough video-text paired data to train the model, it is regarded as challenging to develop lip reading models for low-resource languages.
arXiv Detail & Related papers (2023-08-18T05:19:03Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Low-Resource Multilingual and Zero-Shot Multispeaker TTS [25.707717591185386]
We show that it is possible for a system to learn speaking a new language using just 5 minutes of training data.
We show the success of our proposed approach in terms of intelligibility, naturalness and similarity to target speaker.
arXiv Detail & Related papers (2022-10-21T20:03:37Z) - Multilingual Zero Resource Speech Recognition Base on Self-Supervise
Pre-Trained Acoustic Models [14.887781621924255]
This paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition.
It is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts.
Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages.
arXiv Detail & Related papers (2022-10-13T12:11:18Z) - Few-Shot Cross-Lingual TTS Using Transferable Phoneme Embedding [55.989376102986654]
This paper studies a transferable phoneme embedding framework that aims to deal with the cross-lingual text-to-speech problem under the few-shot setting.
We propose a framework that consists of a phoneme-based TTS model and a codebook module to project phonemes from different languages into a learned latent space.
arXiv Detail & Related papers (2022-06-27T11:24:40Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning [11.552745999302905]
More than half of the 7,000 languages in the world are in imminent danger of going extinct.
It is relatively easy to obtain textual translations corresponding to speech.
We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
arXiv Detail & Related papers (2020-06-04T12:21:48Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.