Revisiting Neural Language Modelling with Syllables
- URL: http://arxiv.org/abs/2010.12881v1
- Date: Sat, 24 Oct 2020 11:44:41 GMT
- Title: Revisiting Neural Language Modelling with Syllables
- Authors: Arturo Oncevay and Kervy Rivas Rojas
- Abstract summary: We reconsider syllables for an open-vocabulary generation task in 20 languages.
We use rule-based syllabification methods for five languages and address the rest with a hyphenation tool.
With a comparable perplexity, we show that syllables outperform characters, annotated morphemes and unsupervised subwords.
- Score: 3.198144010381572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language modelling is regularly analysed at word, subword or character units,
but syllables are seldom used. Syllables provide shorter sequences than
characters, they can be extracted with rules, and their segmentation typically
requires less specialised effort than identifying morphemes. We reconsider
syllables for an open-vocabulary generation task in 20 languages. We use
rule-based syllabification methods for five languages and address the rest with
a hyphenation tool, which behaviour as syllable proxy is validated. With a
comparable perplexity, we show that syllables outperform characters, annotated
morphemes and unsupervised subwords. Finally, we also study the overlapping of
syllables concerning other subword pieces and discuss some limitations and
opportunities.
Related papers
- Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili [29.252250069388687]
Tokenization allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language.
We propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language.
arXiv Detail & Related papers (2024-03-26T17:26:50Z) - Multilingual context-based pronunciation learning for Text-to-Speech [13.941800219395757]
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end.
We showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules.
We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when compared to equivalent monolingual solutions.
arXiv Detail & Related papers (2023-07-31T14:29:06Z) - Revisiting Syllables in Language Modelling and their Application on
Low-Resource Machine Translation [1.2617078020344619]
Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size.
We first explore the potential of syllables for open-vocabulary language modelling in 21 languages.
We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy.
arXiv Detail & Related papers (2022-10-05T18:55:52Z) - Dict-TTS: Learning to Pronounce with Prior Dictionary Knowledge for
Text-to-Speech [88.22544315633687]
Polyphone disambiguation aims to capture accurate pronunciation knowledge from natural text sequences for reliable Text-to-speech systems.
We propose Dict-TTS, a semantic-aware generative text-to-speech model with an online website dictionary.
Experimental results in three languages show that our model outperforms several strong baseline models in terms of pronunciation accuracy.
arXiv Detail & Related papers (2022-06-05T10:50:34Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Syllabic Quantity Patterns as Rhythmic Features for Latin Authorship
Attribution [74.27826764855911]
We employ syllabic quantity as a base for deriving rhythmic features for the task of computational authorship attribution of Latin prose texts.
Our experiments, carried out on three different datasets, using two different machine learning methods, show that rhythmic features based on syllabic quantity are beneficial in discriminating among Latin prose authors.
arXiv Detail & Related papers (2021-10-27T06:25:31Z) - Syllabification of the Divine Comedy [0.0]
We provide a syllabification algorithm for the Divine Comedy using techniques from probabilistic and constraint programming.
We particularly focus on the synalephe, addressed in terms of the "propensity" of a word to take part in a synalephe with adjacent words.
We jointly provide an online vocabulary containing, for each word, information about its syllabification, the location of the tonic accent, and the aforementioned synalephe propensity.
arXiv Detail & Related papers (2020-10-26T12:14:14Z) - Investigating Cross-Linguistic Adjective Ordering Tendencies with a
Latent-Variable Model [66.84264870118723]
We present the first purely corpus-driven model of multi-lingual adjective ordering in the form of a latent-variable model.
We provide strong converging evidence for the existence of universal, cross-linguistic, hierarchical adjective ordering tendencies.
arXiv Detail & Related papers (2020-10-09T18:27:55Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z) - Self-organizing Pattern in Multilayer Network for Words and Syllables [17.69876273827734]
We propose a new universal law that highlights the equally important role of syllables.
By plotting rank-rank frequency distribution of word and syllable for English and Chinese corpora, visible lines appear and can be fit to a master curve.
arXiv Detail & Related papers (2020-05-05T12:01:47Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.