Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement
- URL: http://arxiv.org/abs/2503.14718v1
- Date: Tue, 18 Mar 2025 20:42:42 GMT
- Title: Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement
- Authors: Hakyung Sung, Gyu-Ho Shin,
- Abstract summary: We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences.<n>We fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.
Related papers
- Kanana: Compute-efficient Bilingual Language Models [9.597618914676106]
Kanana is a series of bilingual language models that demonstrate exceeding performance in Korean and competitive performance in English.<n>The report details the techniques employed during pre-training to achieve compute-efficient yet competitive models.<n>The report elaborates on plausible approaches used for language model adaptation to specific scenarios, such as embedding, retrieval augmented generation, and function calling.
arXiv Detail & Related papers (2025-02-26T08:36:20Z) - LLM-based Translation Inference with Iterative Bilingual Understanding [52.46978502902928]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs)<n>The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately.<n>The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z) - Multilingual Nonce Dependency Treebanks: Understanding how Language Models represent and process syntactic structure [15.564927804136852]
SPUD (Semantically Perturbed Universal Dependencies) is a framework for creating nonce treebanks for the Universal Dependencies (UD) corpora.
We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks.
arXiv Detail & Related papers (2023-11-13T17:36:58Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - Data Augmentation for Machine Translation via Dependency Subtree
Swapping [0.0]
We present a generic framework for data augmentation via dependency subtree swapping.
We extract corresponding subtrees from the dependency parse trees of the source and target sentences and swap these across bisentences to create augmented samples.
We conduct resource-constrained experiments on 4 language pairs in both directions using the IWSLT text translation datasets and the Hunglish2 corpus.
arXiv Detail & Related papers (2023-07-13T19:00:26Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Examination and Extension of Strategies for Improving Personalized
Language Modeling via Interpolation [59.35932511895986]
We demonstrate improvements in offline metrics at the user level by interpolating a global LSTM-based authoring model with a user-personalized n-gram model.
We observe that over 80% of users receive a lift in perplexity, with an average of 5.2% in perplexity lift per user.
arXiv Detail & Related papers (2020-06-09T19:29:41Z) - Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD):
Manual Revision to Build Robust Parsing Model in Korean [15.899449418195106]
We first open on important issues regarding the Penn Korean Universal Treebank (PKT-UD)
We address these issues by revising the entire corpus manually with the aim of producing cleaner UD annotations.
For compatibility to the rest of UD corpora, we extensively revise the part-of-speech tags and the dependency relations.
arXiv Detail & Related papers (2020-05-26T17:46:46Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.