Improving Korean NLP Tasks with Linguistically Informed Subword
Tokenization and Sub-character Decomposition
- URL: http://arxiv.org/abs/2311.03928v1
- Date: Tue, 7 Nov 2023 12:08:21 GMT
- Title: Improving Korean NLP Tasks with Linguistically Informed Subword
Tokenization and Sub-character Decomposition
- Authors: Taehee Jeon, Bongseok Yang, Changhwan Kim, Yoonseob Lim
- Abstract summary: We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair.
Our approach balances linguistic accuracy with computational efficiency in Pre-trained Language Models (PLMs)
Our evaluations show that this technique achieves good performances overall, notably improving results in the syntactic task of NIKL-CoLA.
- Score: 6.767341847275751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a morpheme-aware subword tokenization method that utilizes
sub-character decomposition to address the challenges of applying Byte Pair
Encoding (BPE) to Korean, a language characterized by its rich morphology and
unique writing system. Our approach balances linguistic accuracy with
computational efficiency in Pre-trained Language Models (PLMs). Our evaluations
show that this technique achieves good performances overall, notably improving
results in the syntactic task of NIKL-CoLA. This suggests that integrating
morpheme type information can enhance language models' syntactic and semantic
capabilities, indicating that adopting more linguistic insights can further
improve performance beyond standard morphological analysis.
Related papers
- Linguistically Grounded Analysis of Language Models using Shapley Head Values [2.914115079173979]
We investigate the processing of morphosyntactic phenomena by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs)
Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions are handled.
Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information.
arXiv Detail & Related papers (2024-10-17T09:48:08Z) - Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss [9.807885676930308]
We propose an approach to model idiomaticity using a triplet loss that incorporates the asymmetric contribution of components words to an idiomatic meaning for training language models.
Our proposed method is evaluated on a SemEval challenge and outperforms previous alternatives significantly in many metrics.
arXiv Detail & Related papers (2024-06-21T14:21:41Z) - Explicit Morphological Knowledge Improves Pre-training of Language
Models for Hebrew [19.4968960182412]
We investigate the hypothesis that incorporating explicit morphological knowledge in the pre-training phase can improve the performance of PLMs for morphologically rich languages.
We propose various morphologically driven tokenization methods enabling the model to leverage morphological cues beyond raw text.
Our experiments show that morphologically driven tokenization demonstrates improved results compared to a standard language-agnostic tokenization.
arXiv Detail & Related papers (2023-11-01T17:02:49Z) - Towards preserving word order importance through Forced Invalidation [80.33036864442182]
We show that pre-trained language models are insensitive to word order.
We propose Forced Invalidation to help preserve the importance of word order.
Our experiments demonstrate that Forced Invalidation significantly improves the sensitivity of the models to word order.
arXiv Detail & Related papers (2023-04-11T13:42:10Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Visualizing the Relationship Between Encoded Linguistic Information and
Task Performance [53.223789395577796]
We study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality.
We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances.
Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance.
arXiv Detail & Related papers (2022-03-29T19:03:10Z) - Morphologically Aware Word-Level Translation [82.59379608647147]
We propose a novel morphologically aware probability model for bilingual lexicon induction.
Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning.
arXiv Detail & Related papers (2020-11-15T17:54:49Z) - Multilingual Chart-based Constituency Parse Extraction from Pre-trained
Language Models [21.2879567125422]
We propose a novel method for extracting complete (binary) parses from pre-trained language models.
By applying our method on multilingual PLMs, it becomes possible to induce non-trivial parses for sentences from nine languages.
arXiv Detail & Related papers (2020-04-08T05:42:26Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.