Evaluation of Morphological Embeddings for the Russian Language
- URL: http://arxiv.org/abs/2103.06628v1
- Date: Thu, 11 Mar 2021 11:59:11 GMT
- Title: Evaluation of Morphological Embeddings for the Russian Language
- Authors: Vitaly Romanov and Albina Khusainova
- Abstract summary: morphology-based embeddings trained with Skipgram objective do not outperform existing embedding model -- FastText.
A more complex, but morphology unaware model, BERT, allows to achieve significantly greater performance on the tasks that presumably require understanding of a word's morphology.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A number of morphology-based word embedding models were introduced in recent
years. However, their evaluation was mostly limited to English, which is known
to be a morphologically simple language. In this paper, we explore whether and
to what extent incorporating morphology into word embeddings improves
performance on downstream NLP tasks, in the case of morphologically rich
Russian language. NLP tasks of our choice are POS tagging, Chunking, and NER --
for Russian language, all can be mostly solved using only morphology without
understanding the semantics of words. Our experiments show that
morphology-based embeddings trained with Skipgram objective do not outperform
existing embedding model -- FastText. Moreover, a more complex, but morphology
unaware model, BERT, allows to achieve significantly greater performance on the
tasks that presumably require understanding of a word's morphology.
Related papers
- On the Role of Morphological Information for Contextual Lemmatization [7.106986689736827]
We investigate the role of morphological information to develop contextual lemmatizers in six languages.
Basque, Turkish, Russian, Czech, Spanish and English.
Experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology.
arXiv Detail & Related papers (2023-02-01T12:47:09Z) - UniMorph 4.0: Universal Morphology [104.69846084893298]
This paper presents the expansions and improvements made on several fronts over the last couple of years.
Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages.
In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages.
arXiv Detail & Related papers (2022-05-07T09:19:02Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Morphology Without Borders: Clause-Level Morphological Annotation [8.559428282730021]
We propose to view morphology as a clause-level phenomenon, rather than word-level.
We deliver a novel dataset for clause-level morphology covering 4 typologically-different languages: English, German, Turkish and Hebrew.
Our experiments show that the clause-level tasks are substantially harder than the respective word-level tasks, while having comparable complexity across languages.
arXiv Detail & Related papers (2022-02-25T17:20:28Z) - Morph Call: Probing Morphosyntactic Content of Multilingual Transformers [2.041108289731398]
We present Morph Call, a suite of 46 probing tasks for four Indo-European languages of different morphology: English, French, German and Russian.
We use a combination of neuron-, layer- and representation-level introspection techniques to analyze the morphosyntactic content of four multilingual transformers.
The results show that fine-tuning for POS-tagging can improve and decrease the probing performance and change how morphosyntactic knowledge is distributed across the model.
arXiv Detail & Related papers (2021-04-26T19:53:00Z) - Morphologically Aware Word-Level Translation [82.59379608647147]
We propose a novel morphologically aware probability model for bilingual lexicon induction.
Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning.
arXiv Detail & Related papers (2020-11-15T17:54:49Z) - Morphological Disambiguation from Stemming Data [1.2183405753834562]
Kinyarwanda, a morphologically rich language, currently lacks tools for automated morphological analysis.
We learn to morphologically disambiguate Kinyarwanda verbal forms from a new stemming dataset collected through crowd-sourcing.
Our experiments reveal that inflectional properties of stems and morpheme association rules are the most discriminative features for disambiguation.
arXiv Detail & Related papers (2020-11-11T01:44:09Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Morphological Word Segmentation on Agglutinative Languages for Neural
Machine Translation [8.87546236839959]
We propose a morphological word segmentation method on the source-side for Neural machine translation (NMT)
It incorporates morphology knowledge to preserve the linguistic and semantic information in the word structure while reducing the vocabulary size at training time.
It can be utilized as a preprocessing tool to segment the words in agglutinative languages for other natural language processing (NLP) tasks.
arXiv Detail & Related papers (2020-01-02T10:05:02Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.