K-UniMorph: Korean Universal Morphology and its Feature Schema
- URL: http://arxiv.org/abs/2305.06335v3
- Date: Wed, 17 May 2023 08:29:58 GMT
- Title: K-UniMorph: Korean Universal Morphology and its Feature Schema
- Authors: Eunkyul Leah Jo and Kyuwon Kim and Xihan Wu and KyungTae Lim and
Jungyeul Park and Chulwoo Park
- Abstract summary: We present a new Universal Morphology dataset for Korean.
We outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata.
We carry out the inflection task using three different Korean word forms: letters, syllables and morphemes.
- Score: 1.3048920509133806
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present in this work a new Universal Morphology dataset for Korean.
Previously, the Korean language has been underrepresented in the field of
morphological paradigms amongst hundreds of diverse world languages. Hence, we
propose this Universal Morphological paradigms for the Korean language that
preserve its distinct characteristics. For our K-UniMorph dataset, we outline
each grammatical criterion in detail for the verbal endings, clarify how to
extract inflected forms, and demonstrate how we generate the morphological
schemata. This dataset adopts morphological feature schema from Sylak-Glassman
et al. (2015) and Sylak-Glassman (2016) for the Korean language as we extract
inflected verb forms from the Sejong morphologically analyzed corpus that is
one of the largest annotated corpora for Korean. During the data creation, our
methodology also includes investigating the correctness of the conversion from
the Sejong corpus. Furthermore, we carry out the inflection task using three
different Korean word forms: letters, syllables and morphemes. Finally, we
discuss and describe future perspectives on Korean morphological paradigms and
the dataset.
Related papers
- Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers [7.275938266030414]
Syntactic elements, such as word order and case markers, are fundamental in natural language processing.
This study explores whether Korean language models can accurately capture this flexibility.
arXiv Detail & Related papers (2024-07-12T11:33:41Z) - Labeled Morphological Segmentation with Semi-Markov Models [127.69031138022534]
We present labeled morphological segmentation, an alternative view of morphological processing that unifies several tasks.
We additionally introduce a new hierarchy of morphotactic tagsets.
We develop modelname, a discriminative morphological segmentation system that explicitly models morphotactics.
arXiv Detail & Related papers (2024-04-13T12:51:53Z) - Word segmentation granularity in Korean [1.0619039878979954]
There are multiple possible levels of word segmentation granularity in Korean.
For specific language processing and corpus annotation tasks, several different granularity levels have been proposed and utilized.
Interestingly, the granularity by separating only functional morphemes results in the optimal performance for phrase structure parsing.
arXiv Detail & Related papers (2023-09-07T13:42:05Z) - Korean Named Entity Recognition Based on Language-Specific Features [3.1884260020646265]
We propose a novel way of improving named entity recognition in the Korean language using its language-specific features.
The proposed scheme decomposes Korean words into morphemes and reduces the ambiguity of named entities.
Analyses of the results of statistical and neural models reveal that the proposed morpheme-based format is feasible.
arXiv Detail & Related papers (2023-05-10T17:34:52Z) - Yet Another Format of Universal Dependencies for Korean [4.909210276089872]
morphUD outperforms parsing results for all Korean UD treebanks.
We develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically.
arXiv Detail & Related papers (2022-09-20T14:21:00Z) - UniMorph 4.0: Universal Morphology [104.69846084893298]
This paper presents the expansions and improvements made on several fronts over the last couple of years.
Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages.
In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages.
arXiv Detail & Related papers (2022-05-07T09:19:02Z) - Quantifying Synthesis and Fusion and their Impact on Machine Translation [79.61874492642691]
In Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative.
In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level.
For computing literature, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study.
Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish,
arXiv Detail & Related papers (2022-05-06T17:04:58Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.