On the Role of Morphological Information for Contextual Lemmatization
- URL: http://arxiv.org/abs/2302.00407v3
- Date: Fri, 20 Oct 2023 15:31:57 GMT
- Title: On the Role of Morphological Information for Contextual Lemmatization
- Authors: Olia Toporkov, Rodrigo Agerri
- Abstract summary: We investigate the role of morphological information to develop contextual lemmatizers in six languages.
Basque, Turkish, Russian, Czech, Spanish and English.
Experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology.
- Score: 7.106986689736827
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Lemmatization is a natural language processing (NLP) task which consists of
producing, from a given inflected word, its canonical form or lemma.
Lemmatization is one of the basic tasks that facilitate downstream NLP
applications, and is of particular importance for high-inflected languages.
Given that the process to obtain a lemma from an inflected word can be
explained by looking at its morphosyntactic category, including fine-grained
morphosyntactic information to train contextual lemmatizers has become common
practice, without considering whether that is the optimum in terms of
downstream performance. In order to address this issue, in this paper we
empirically investigate the role of morphological information to develop
contextual lemmatizers in six languages within a varied spectrum of
morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English.
Furthermore, and unlike the vast majority of previous work, we also evaluate
lemmatizers in out-of-domain settings, which constitutes, after all, their most
common application use. The results of our study are rather surprising. It
turns out that providing lemmatizers with fine-grained morphological features
during training is not that beneficial, not even for agglutinative languages.
In fact, modern contextual word representations seem to implicitly encode
enough morphological information to obtain competitive contextual lemmatizers
without seeing any explicit morphological signal. Moreover, our experiments
suggest that the best lemmatizers out-of-domain are those using simple UPOS
tags or those trained without morphology and, finally, that current evaluation
practices for lemmatization are not adequate to clearly discriminate between
models.
Related papers
- A Morphology-Based Investigation of Positional Encodings [46.667985003225496]
Morphology and word order are closely linked, with the latter incorporated into transformer-based models through positional encodings.
This prompts a fundamental inquiry: Is there a correlation between the morphological complexity of a language and the utilization of positional encoding in pre-trained language models?
In pursuit of an answer, we present the first study addressing this question, encompassing 22 languages and 5 downstream tasks.
arXiv Detail & Related papers (2024-04-06T07:10:47Z) - Evaluating Shortest Edit Script Methods for Contextual Lemmatization [6.0158981171030685]
Modern contextual lemmatizers often rely on automatically induced Shortest Edit Scripts (SES) to transform a word form into its lemma.
Previous work has not investigated the direct impact of SES in the final lemmatization performance.
We show that computing the casing and edit operations separately is beneficial overall, but much more clearly for languages with high-inflected morphology.
arXiv Detail & Related papers (2024-03-25T17:28:24Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Morphology Without Borders: Clause-Level Morphological Annotation [8.559428282730021]
We propose to view morphology as a clause-level phenomenon, rather than word-level.
We deliver a novel dataset for clause-level morphology covering 4 typologically-different languages: English, German, Turkish and Hebrew.
Our experiments show that the clause-level tasks are substantially harder than the respective word-level tasks, while having comparable complexity across languages.
arXiv Detail & Related papers (2022-02-25T17:20:28Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Evaluation of Morphological Embeddings for the Russian Language [0.0]
morphology-based embeddings trained with Skipgram objective do not outperform existing embedding model -- FastText.
A more complex, but morphology unaware model, BERT, allows to achieve significantly greater performance on the tasks that presumably require understanding of a word's morphology.
arXiv Detail & Related papers (2021-03-11T11:59:11Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.