Korean Named Entity Recognition Based on Language-Specific Features
- URL: http://arxiv.org/abs/2305.06330v1
- Date: Wed, 10 May 2023 17:34:52 GMT
- Title: Korean Named Entity Recognition Based on Language-Specific Features
- Authors: Yige Chen and KyungTae Lim and Jungyeul Park
- Abstract summary: We propose a novel way of improving named entity recognition in the Korean language using its language-specific features.
The proposed scheme decomposes Korean words into morphemes and reduces the ambiguity of named entities.
Analyses of the results of statistical and neural models reveal that the proposed morpheme-based format is feasible.
- Score: 3.1884260020646265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the paper, we propose a novel way of improving named entity recognition in
the Korean language using its language-specific features. While the field of
named entity recognition has been studied extensively in recent years, the
mechanism of efficiently recognizing named entities in Korean has hardly been
explored. This is because the Korean language has distinct linguistic
properties that prevent models from achieving their best performances.
Therefore, an annotation scheme for {Korean corpora} by adopting the CoNLL-U
format, which decomposes Korean words into morphemes and reduces the ambiguity
of named entities in the original segmentation that may contain functional
morphemes such as postpositions and particles, is proposed herein. We
investigate how the named entity tags are best represented in this
morpheme-based scheme and implement an algorithm to convert word-based {and
syllable-based Korean corpora} with named entities into the proposed
morpheme-based format. Analyses of the results of {statistical and neural}
models reveal that the proposed morpheme-based format is feasible, and the
{varied} performances of the models under the influence of various additional
language-specific features are demonstrated. Extrinsic conditions were also
considered to observe the variance of the performances of the proposed models,
given different types of data, including the original segmentation and
different types of tagging formats.
Related papers
- Does Incomplete Syntax Influence Korean Language Model? Focusing on Word Order and Case Markers [7.275938266030414]
Syntactic elements, such as word order and case markers, are fundamental in natural language processing.
This study explores whether Korean language models can accurately capture this flexibility.
arXiv Detail & Related papers (2024-07-12T11:33:41Z) - Unsupervised Morphological Tree Tokenizer [36.584680344291556]
We introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words.
Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $textitOverriding$ to ensure the indecomposability of morphemes.
Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner.
arXiv Detail & Related papers (2024-06-21T15:35:49Z) - Improving Korean NLP Tasks with Linguistically Informed Subword
Tokenization and Sub-character Decomposition [6.767341847275751]
We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair.
Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs)
Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA.
arXiv Detail & Related papers (2023-11-07T12:08:21Z) - Word segmentation granularity in Korean [1.0619039878979954]
There are multiple possible levels of word segmentation granularity in Korean.
For specific language processing and corpus annotation tasks, several different granularity levels have been proposed and utilized.
Interestingly, the granularity by separating only functional morphemes results in the optimal performance for phrase structure parsing.
arXiv Detail & Related papers (2023-09-07T13:42:05Z) - Yet Another Format of Universal Dependencies for Korean [4.909210276089872]
morphUD outperforms parsing results for all Korean UD treebanks.
We develop scripts that convert between the original format used by Universal Dependencies and the proposed morpheme-based format automatically.
arXiv Detail & Related papers (2022-09-20T14:21:00Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Graph Adaptive Semantic Transfer for Cross-domain Sentiment
Classification [68.06496970320595]
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain.
We present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs.
arXiv Detail & Related papers (2022-05-18T07:47:01Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.