Comparison of Turkish Word Representations Trained on Different
Morphological Forms
- URL: http://arxiv.org/abs/2002.05417v1
- Date: Thu, 13 Feb 2020 10:09:31 GMT
- Title: Comparison of Turkish Word Representations Trained on Different
Morphological Forms
- Authors: G\"okhan G\"uler, A. C\"uneyd Tantu\u{g}
- Abstract summary: This study prepared texts in morphologically different forms in a morphologically rich language, Turkish.
We trained word2vec model on texts which lemma and suffixes are treated differently.
We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Increased popularity of different text representations has also brought many
improvements in Natural Language Processing (NLP) tasks. Without need of
supervised data, embeddings trained on large corpora provide us meaningful
relations to be used on different NLP tasks. Even though training these vectors
is relatively easy with recent methods, information gained from the data
heavily depends on the structure of the corpus language. Since the popularly
researched languages have a similar morphological structure, problems occurring
for morphologically rich languages are mainly disregarded in studies. For
morphologically rich languages, context-free word vectors ignore morphological
structure of languages. In this study, we prepared texts in morphologically
different forms in a morphologically rich language, Turkish, and compared the
results on different intrinsic and extrinsic tasks. To see the effect of
morphological structure, we trained word2vec model on texts which lemma and
suffixes are treated differently. We also trained subword model fastText and
compared the embeddings on word analogy, text classification, sentimental
analysis, and language model tasks.
Related papers
- Why do language models perform worse for morphologically complex languages? [0.913127392774573]
We find new evidence for a performance gap between agglutinative and fusional languages.
We propose three possible causes for this performance gap: morphological alignment of tokenizers, tokenization quality, and disparities in dataset sizes and measurement.
Results suggest that no language is harder or easier for a language model to learn on the basis of its morphological typology.
arXiv Detail & Related papers (2024-11-21T15:06:51Z) - UzMorphAnalyser: A Morphological Analysis Model for the Uzbek Language Using Inflectional Endings [0.0]
Affixes play an important role in the morphological analysis of words, by adding additional meanings and grammatical functions to words.
This paper present modeling of the morphological analysis of Uzbek words, including stemming, lemmatizing, and the extraction of morphological information.
The developed tool based on the proposed model is available as a web-based application and an open-source Python library.
arXiv Detail & Related papers (2024-05-23T05:06:55Z) - On the Role of Morphological Information for Contextual Lemmatization [7.106986689736827]
We investigate the role of morphological information to develop contextual lemmatizers in six languages.
Basque, Turkish, Russian, Czech, Spanish and English.
Experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology.
arXiv Detail & Related papers (2023-02-01T12:47:09Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - Low-Dimensional Structure in the Space of Language Representations is
Reflected in Brain Responses [62.197912623223964]
We show a low-dimensional structure where language models and translation models smoothly interpolate between word embeddings, syntactic and semantic tasks, and future word embeddings.
We find that this representation embedding can predict how well each individual feature space maps to human brain responses to natural language stimuli recorded using fMRI.
This suggests that the embedding captures some part of the brain's natural language representation structure.
arXiv Detail & Related papers (2021-06-09T22:59:12Z) - Neural disambiguation of lemma and part of speech in morphologically
rich languages [0.6346772579930928]
We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages.
We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser.
arXiv Detail & Related papers (2020-07-12T21:48:52Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.