Yet Another Format of Universal Dependencies for Korean
- URL: http://arxiv.org/abs/2209.09742v1
- Date: Tue, 20 Sep 2022 14:21:00 GMT
- Title: Yet Another Format of Universal Dependencies for Korean
- Authors: Yige Chen and Eunkyul Leah Jo and Yundong Yao and KyungTae Lim and
Miikka Silfverberg and Francis M. Tyers and Jungyeul Park
- Abstract summary: morphUD outperforms parsing results for all Korean UD treebanks.
We develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically.
- Score: 4.909210276089872
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we propose a morpheme-based scheme for Korean dependency
parsing and adopt the proposed scheme to Universal Dependencies. We present the
linguistic rationale that illustrates the motivation and the necessity of
adopting the morpheme-based format, and develop scripts that convert between
the original format used by Universal Dependencies and the proposed
morpheme-based format automatically. The effectiveness of the proposed format
for Korean dependency parsing is then testified by both statistical and neural
models, including UDPipe and Stanza, with our carefully constructed
morpheme-based word embedding for Korean. morphUD outperforms parsing results
for all Korean UD treebanks, and we also present detailed error analyses.
Related papers
- Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers [7.275938266030414]
Syntactic elements, such as word order and case markers, are fundamental in natural language processing.
This study explores whether Korean language models can accurately capture this flexibility.
arXiv Detail & Related papers (2024-07-12T11:33:41Z) - Improving Korean NLP Tasks with Linguistically Informed Subword
Tokenization and Sub-character Decomposition [6.767341847275751]
We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair.
Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs)
Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA.
arXiv Detail & Related papers (2023-11-07T12:08:21Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - K-UniMorph: Korean Universal Morphology and its Feature Schema [1.3048920509133806]
We present a new Universal Morphology dataset for Korean.
We outline each grammatical criterion in detail for the verbal endings, clarify how to extract inflected forms, and demonstrate how we generate the morphological schemata.
We carry out the inflection task using three different Korean word forms: letters, syllables and morphemes.
arXiv Detail & Related papers (2023-05-10T17:44:01Z) - Korean Named Entity Recognition Based on Language-Specific Features [3.1884260020646265]
We propose a novel way of improving named entity recognition in the Korean language using its language-specific features.
The proposed scheme decomposes Korean words into morphemes and reduces the ambiguity of named entities.
Analyses of the results of statistical and neural models reveal that the proposed morpheme-based format is feasible.
arXiv Detail & Related papers (2023-05-10T17:34:52Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - UniMorph 4.0: Universal Morphology [104.69846084893298]
This paper presents the expansions and improvements made on several fronts over the last couple of years.
Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages.
In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages.
arXiv Detail & Related papers (2022-05-07T09:19:02Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - Morphologically Aware Word-Level Translation [82.59379608647147]
We propose a novel morphologically aware probability model for bilingual lexicon induction.
Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning.
arXiv Detail & Related papers (2020-11-15T17:54:49Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z) - Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD):
Manual Revision to Build Robust Parsing Model in Korean [15.899449418195106]
We first open on important issues regarding the Penn Korean Universal Treebank (PKT-UD)
We address these issues by revising the entire corpus manually with the aim of producing cleaner UD annotations.
For compatibility to the rest of UD corpora, we extensively revise the part-of-speech tags and the dependency relations.
arXiv Detail & Related papers (2020-05-26T17:46:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.