Related papers: Unsupervised Morphological Paradigm Completion

Unsupervised Morphological Paradigm Completion

URL: http://arxiv.org/abs/2005.00970v2
Date: Wed, 20 May 2020 22:56:34 GMT
Title: Unsupervised Morphological Paradigm Completion
Authors: Huiming Jin, Liwei Cai, Yihui Peng, Chen Xia, Arya D. McCarthy, Katharina Kann
Abstract summary: Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. We introduce a system for the task, which generates morphological paradigms via the following steps: (i) EDIT TREE retrieval, (ii) additional lemma retrieval, (iii) paradigm size discovery, and (iv) inflection generation. Our system outperforms trivial baselines with ease and, for some languages, even obtains a higher accuracy than minimally supervised systems.
Score: 26.318483685612765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or to assist linguistic annotators. From a cognitive science perspective, this can shed light on how children acquire morphological knowledge. We further introduce a system for the task, which generates morphological paradigms via the following steps: (i) EDIT TREE retrieval, (ii) additional lemma retrieval, (iii) paradigm size discovery, and (iv) inflection generation. We perform an evaluation on 14 typologically diverse languages. Our system outperforms trivial baselines with ease and, for some languages, even obtains a higher accuracy than minimally supervised systems.

Related papers

modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models [23.105555180223487]
modeLing is a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. We evaluate several large open source language models and GPT on our benchmark.
arXiv Detail & Related papers (2024-06-24T18:00:59Z)
Large Language Models for Information Retrieval: A Survey [58.30439850203101]
Information retrieval has evolved from term-based methods to its integration with advanced neural models. Recent research has sought to leverage large language models (LLMs) to improve IR systems. We delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers.
arXiv Detail & Related papers (2023-08-14T12:47:22Z)
Tree of Thoughts: Deliberate Problem Solving with Large Language Models [52.31950122881687]
We introduce a new framework for language model inference, Tree of Thoughts (ToT) ToT generalizes over the popular Chain of Thought approach to prompting language models. Our experiments show that ToT significantly enhances language models' problem-solving abilities.
arXiv Detail & Related papers (2023-05-17T23:16:17Z)
On the Role of Morphological Information for Contextual Lemmatization [7.106986689736827]
We investigate the role of morphological information to develop contextual lemmatizers in six languages. Basque, Turkish, Russian, Czech, Spanish and English. Experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology.
arXiv Detail & Related papers (2023-02-01T12:47:09Z)
Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation. A large amount of differently inflected word surface forms entails a larger vocabulary. Some inflected forms of infrequent terms typically do not appear in the training corpus. Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z)
Morphological Processing of Low-Resource Languages: Where We Are and What's Next [23.7371787793763]
We focus on approaches suitable for languages with minimal or no annotated resources. We argue that the field is ready to tackle the logical next challenge: understanding a language's morphology from raw text alone.
arXiv Detail & Related papers (2022-03-16T19:47:04Z)
Morphology Without Borders: Clause-Level Morphological Annotation [8.559428282730021]
We propose to view morphology as a clause-level phenomenon, rather than word-level. We deliver a novel dataset for clause-level morphology covering 4 typologically-different languages: English, German, Turkish and Hebrew. Our experiments show that the clause-level tasks are substantially harder than the respective word-level tasks, while having comparable complexity across languages.
arXiv Detail & Related papers (2022-02-25T17:20:28Z)
Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z)
Information-Theoretic Probing for Linguistic Structure [74.04862204427944]
We propose an information-theoretic operationalization of probing as estimating mutual information. We evaluate on a set of ten typologically diverse languages often underrepresented in NLP research.
arXiv Detail & Related papers (2020-04-07T01:06:36Z)
A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages. Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.