Morphology Matters: A Multilingual Language Modeling Analysis
- URL: http://arxiv.org/abs/2012.06262v1
- Date: Fri, 11 Dec 2020 11:55:55 GMT
- Title: Morphology Matters: A Multilingual Language Modeling Analysis
- Authors: Hyunji Hayley Park, Katherine J. Zhang, Coleman Haley, Kenneth
Steimel, Han Liu, Lane Schwartz
- Abstract summary: Prior studies disagree on whether inflectional morphology makes languages harder to model.
We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.
Several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data.
- Score: 8.791030561752384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior studies in multilingual language modeling (e.g., Cotterell et al.,
2018; Mielke et al., 2019) disagree on whether or not inflectional morphology
makes languages harder to model. We attempt to resolve the disagreement and
extend those studies. We compile a larger corpus of 145 Bible translations in
92 languages and a larger number of typological features. We fill in missing
typological data for several languages and consider corpus-based measures of
morphological complexity in addition to expert-produced typological features.
We find that several morphological measures are significantly associated with
higher surprisal when LSTM models are trained with BPE-segmented data. We also
investigate linguistically-motivated subword segmentation strategies like
Morfessor and Finite-State Transducers (FSTs) and find that these segmentation
strategies yield better performance and reduce the impact of a language's
morphology on language modeling.
Related papers
- UzMorphAnalyser: A Morphological Analysis Model for the Uzbek Language Using Inflectional Endings [0.0]
Affixes play an important role in the morphological analysis of words, by adding additional meanings and grammatical functions to words.
This paper present modeling of the morphological analysis of Uzbek words, including stemming, lemmatizing, and the extraction of morphological information.
The developed tool based on the proposed model is available as a web-based application and an open-source Python library.
arXiv Detail & Related papers (2024-05-23T05:06:55Z) - Explicit Morphological Knowledge Improves Pre-training of Language
Models for Hebrew [19.4968960182412]
We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for morphologically rich languages.
We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text.
Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization.
arXiv Detail & Related papers (2023-11-01T17:02:49Z) - UniMorph 4.0: Universal Morphology [104.69846084893298]
This paper presents the expansions and improvements made on several fronts over the last couple of years.
Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages.
In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages.
arXiv Detail & Related papers (2022-05-07T09:19:02Z) - Same Neurons, Different Languages: Probing Morphosyntax in Multilingual
Pre-trained Models [84.86942006830772]
We conjecture that multilingual pre-trained models can derive language-universal abstractions about grammar.
We conduct the first large-scale empirical study over 43 languages and 14 morphosyntactic categories with a state-of-the-art neuron-level probe.
arXiv Detail & Related papers (2022-05-04T12:22:31Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - How Suitable Are Subword Segmentation Strategies for Translating
Non-Concatenative Morphology? [26.71325671956197]
We design a test suite to evaluate segmentation strategies on different types of morphological phenomena.
We find that learning to analyse and generate morphologically complex surface representations is still challenging.
arXiv Detail & Related papers (2021-09-02T17:23:21Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - Comparison of Turkish Word Representations Trained on Different
Morphological Forms [0.0]
This study prepared texts in morphologically different forms in a morphologically rich language, Turkish.
We trained word2vec model on texts which lemma and suffixes are treated differently.
We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.
arXiv Detail & Related papers (2020-02-13T10:09:31Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.