Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects
- URL: http://arxiv.org/abs/2104.04091v1
- Date: Thu, 8 Apr 2021 21:36:21 GMT
- Title: Grapheme-to-Phoneme Transformer Model for Transfer Learning Dialects
- Authors: Eric Engelhart, Mahsa Elyasi, Gaurav Bharaj
- Abstract summary: Grapheme-to-Phoneme (G2P) models convert words to their phonetic pronunciations.
Usually, dictionary-based methods require significant manual effort to build, and have limited adaptivity on unseen words.
We propose a novel use of transformer-based attention model that can adapt to unseen dialects of English language, while using a small dictionary.
- Score: 1.3786433185027864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grapheme-to-Phoneme (G2P) models convert words to their phonetic
pronunciations. Classic G2P methods include rule-based systems and
pronunciation dictionaries, while modern G2P systems incorporate learning, such
as, LSTM and Transformer-based attention models. Usually, dictionary-based
methods require significant manual effort to build, and have limited adaptivity
on unseen words. And transformer-based models require significant training
data, and do not generalize well, especially for dialects with limited data.
We propose a novel use of transformer-based attention model that can adapt to
unseen dialects of English language, while using a small dictionary. We show
that our method has potential applications for accent transfer for
text-to-speech, and for building robust G2P models for dialects with limited
pronunciation dictionary size.
We experiment with two English dialects: Indian and British. A model trained
from scratch using 1000 words from British English dictionary, with 14211 words
held out, leads to phoneme error rate (PER) of 26.877%, on a test set generated
using the full dictionary. The same model pretrained on CMUDict American
English dictionary, and fine-tuned on the same dataset leads to PER of 2.469%
on the test set.
Related papers
- Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules.
We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z) - Improving grapheme-to-phoneme conversion by learning pronunciations from
speech recordings [12.669655363646257]
The Grapheme-to-Phoneme (G2P) task aims to convert orthographic input into a discrete phonetic representation.
We propose a method to improve the G2P conversion task by learning pronunciation examples from audio recordings.
arXiv Detail & Related papers (2023-07-31T13:25:38Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation [10.016862617549991]
This paper proposes SoundChoice, a novel Grapheme-to-Phoneme (G2P) architecture that processes entire sentences rather than operating at the word level.
SoundChoice achieves a Phoneme Error Rate (PER) of 2.65% on whole-sentence transcription using data from LibriSpeech and Wikipedia.
arXiv Detail & Related papers (2022-07-27T01:14:59Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Neural Machine Translation for Multilingual Grapheme-to-Phoneme
Conversion [13.543705472805431]
We present a single end-to-end trained neural G2P model that shares same encoder and decoder across multiple languages.
We show 7.2% average improvement in phoneme error rate over low resource languages and no over high resource ones compared to monolingual baselines.
arXiv Detail & Related papers (2020-06-25T06:16:29Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.