Revisiting Syllables in Language Modelling and their Application on
Low-Resource Machine Translation
- URL: http://arxiv.org/abs/2210.02509v1
- Date: Wed, 5 Oct 2022 18:55:52 GMT
- Title: Revisiting Syllables in Language Modelling and their Application on
Low-Resource Machine Translation
- Authors: Arturo Oncevay, Kervy Dante Rivas Rojas, Liz Karen Chavez Sanchez,
Roberto Zariquiey
- Abstract summary: Syllables provide shorter sequences than characters, require less-specialised extracting rules than morphemes, and their segmentation is not impacted by the corpus size.
We first explore the potential of syllables for open-vocabulary language modelling in 21 languages.
We use rule-based syllabification methods for six languages and address the rest with hyphenation, which works as a syllabification proxy.
- Score: 1.2617078020344619
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language modelling and machine translation tasks mostly use subword or
character inputs, but syllables are seldom used. Syllables provide shorter
sequences than characters, require less-specialised extracting rules than
morphemes, and their segmentation is not impacted by the corpus size. In this
study, we first explore the potential of syllables for open-vocabulary language
modelling in 21 languages. We use rule-based syllabification methods for six
languages and address the rest with hyphenation, which works as a
syllabification proxy. With a comparable perplexity, we show that syllables
outperform characters and other subwords. Moreover, we study the importance of
syllables on neural machine translation for a non-related and low-resource
language-pair (Spanish--Shipibo-Konibo). In pairwise and multilingual systems,
syllables outperform unsupervised subwords, and further morphological
segmentation methods, when translating into a highly synthetic language with a
transparent orthography (Shipibo-Konibo). Finally, we perform some human
evaluation, and discuss limitations and opportunities.
Related papers
- Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili [29.252250069388687]
Tokenization allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language.
We propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language.
arXiv Detail & Related papers (2024-03-26T17:26:50Z) - Design and Implementation of a Tool for Extracting Uzbek Syllables [0.0]
Syllabification is a versatile linguistic tool with applications in linguistic research, language technology, education, and various fields.
We present a comprehensive approach to syllabification for the Uzbek language, including rule-based techniques and machine learning algorithms.
The results of our experiments show that both approaches achieved a high level of accuracy, exceeding 99%.
arXiv Detail & Related papers (2023-12-25T17:46:58Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Syllabification of the Divine Comedy [0.0]
We provide a syllabification algorithm for the Divine Comedy using techniques from probabilistic and constraint programming.
We particularly focus on the synalephe, addressed in terms of the "propensity" of a word to take part in a synalephe with adjacent words.
We jointly provide an online vocabulary containing, for each word, information about its syllabification, the location of the tonic accent, and the aforementioned synalephe propensity.
arXiv Detail & Related papers (2020-10-26T12:14:14Z) - Revisiting Neural Language Modelling with Syllables [3.198144010381572]
We reconsider syllables for an open-vocabulary generation task in 20 languages.
We use rule-based syllabification methods for five languages and address the rest with a hyphenation tool.
With a comparable perplexity, we show that syllables outperform characters, annotated morphemes and unsupervised subwords.
arXiv Detail & Related papers (2020-10-24T11:44:41Z) - Self-organizing Pattern in Multilayer Network for Words and Syllables [17.69876273827734]
We propose a new universal law that highlights the equally important role of syllables.
By plotting rank-rank frequency distribution of word and syllable for English and Chinese corpora, visible lines appear and can be fit to a master curve.
arXiv Detail & Related papers (2020-05-05T12:01:47Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.