Neural disambiguation of lemma and part of speech in morphologically
rich languages
- URL: http://arxiv.org/abs/2007.06104v1
- Date: Sun, 12 Jul 2020 21:48:52 GMT
- Title: Neural disambiguation of lemma and part of speech in morphologically
rich languages
- Authors: Jos\'e Mar\'ia Hoya Quecedo, Maximilian W. Koppatz, Giacomo Furlan,
Roman Yangarber
- Abstract summary: We consider the problem of disambiguating the lemma and part of speech of ambiguous words in morphologically rich languages.
We propose a method for disambiguating ambiguous words in context, using a large un-annotated corpus of text, and a morphological analyser.
- Score: 0.6346772579930928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of disambiguating the lemma and part of speech of
ambiguous words in morphologically rich languages. We propose a method for
disambiguating ambiguous words in context, using a large un-annotated corpus of
text, and a morphological analyser -- with no manual disambiguation or data
annotation. We assume that the morphological analyser produces multiple
analyses for ambiguous words. The idea is to train recurrent neural networks on
the output that the morphological analyser produces for unambiguous words. We
present performance on POS and lemma disambiguation that reaches or surpasses
the state of the art -- including supervised models -- using no manually
annotated data. We evaluate the method on several morphologically rich
languages.
Related papers
- UzMorphAnalyser: A Morphological Analysis Model for the Uzbek Language Using Inflectional Endings [0.0]
Affixes play an important role in the morphological analysis of words, by adding additional meanings and grammatical functions to words.
This paper present modeling of the morphological analysis of Uzbek words, including stemming, lemmatizing, and the extraction of morphological information.
The developed tool based on the proposed model is available as a web-based application and an open-source Python library.
arXiv Detail & Related papers (2024-05-23T05:06:55Z) - On the Role of Morphological Information for Contextual Lemmatization [7.106986689736827]
We investigate the role of morphological information to develop contextual lemmatizers in six languages.
Basque, Turkish, Russian, Czech, Spanish and English.
Experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology.
arXiv Detail & Related papers (2023-02-01T12:47:09Z) - Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative.
In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level.
For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study.
Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z) - Testing the Ability of Language Models to Interpret Figurative Language [69.59943454934799]
Figurative and metaphorical language are commonplace in discourse.
It remains an open question to what extent modern language models can interpret nonliteral phrases.
We introduce Fig-QA, a Winograd-style nonliteral language understanding task.
arXiv Detail & Related papers (2022-04-26T23:42:22Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - How Suitable Are Subword Segmentation Strategies for Translating
Non-Concatenative Morphology? [26.71325671956197]
We design a test suite to evaluate segmentation strategies on different types of morphological phenomena.
We find that learning to analyse and generate morphologically complex surface representations is still challenging.
arXiv Detail & Related papers (2021-09-02T17:23:21Z) - Morphological Disambiguation from Stemming Data [1.2183405753834562]
Kinyarwanda, a morphologically rich language, currently lacks tools for automated morphological analysis.
We learn to morphologically disambiguate Kinyarwanda verbal forms from a new stemming dataset collected through crowd-sourcing.
Our experiments reveal that inflectional properties of stems and morpheme association rules are the most discriminative features for disambiguation.
arXiv Detail & Related papers (2020-11-11T01:44:09Z) - Mechanisms for Handling Nested Dependencies in Neural-Network Language
Models and Humans [75.15855405318855]
We studied whether a modern artificial neural network trained with "deep learning" methods mimics a central aspect of human sentence processing.
Although the network was solely trained to predict the next word in a large corpus, analysis showed the emergence of specialized units that successfully handled local and long-distance syntactic agreement.
We tested the model's predictions in a behavioral experiment where humans detected violations in number agreement in sentences with systematic variations in the singular/plural status of multiple nouns.
arXiv Detail & Related papers (2020-06-19T12:00:05Z) - Evaluating Transformer-Based Multilingual Text Classification [55.53547556060537]
We argue that NLP tools perform unequally across languages with different syntactic and morphological structures.
We calculate word order and morphological similarity indices to aid our empirical study.
arXiv Detail & Related papers (2020-04-29T03:34:53Z) - Comparison of Turkish Word Representations Trained on Different
Morphological Forms [0.0]
This study prepared texts in morphologically different forms in a morphologically rich language, Turkish.
We trained word2vec model on texts which lemma and suffixes are treated differently.
We also trained subword model fastText and compared the embeddings on word analogy, text classification, sentimental analysis, and language model tasks.
arXiv Detail & Related papers (2020-02-13T10:09:31Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.