UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags
- URL: http://arxiv.org/abs/2506.09009v2
- Date: Wed, 11 Jun 2025 05:35:33 GMT
- Title: UD-KSL Treebank v1.3: A semi-automated framework for aligning XPOS-extracted units with UPOS tags
- Authors: Hakyung Sung, Gyu-Ho Shin, Chanyoung Lee, You Kyung Sung, Boo Kyung Jung,
- Abstract summary: We introduce a semi-language framework that identifies morphosyntactic constructions from XPOS sequences and aligns those constructions with corresponding UPOS categories.<n>We also broaden the existing L2-Korean corpus by annotating 2,998 new sentences from argumentative essays.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The present study extends recent work on Universal Dependencies annotations for second-language (L2) Korean by introducing a semi-automated framework that identifies morphosyntactic constructions from XPOS sequences and aligns those constructions with corresponding UPOS categories. We also broaden the existing L2-Korean corpus by annotating 2,998 new sentences from argumentative essays. To evaluate the impact of XPOS-UPOS alignments, we fine-tune L2-Korean morphosyntactic analysis models on datasets both with and without these alignments, using two NLP toolkits. Our results indicate that the aligned dataset not only improves consistency across annotation layers but also enhances morphosyntactic tagging and dependency-parsing accuracy, particularly in cases of limited annotated data.
Related papers
- Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems [0.0]
We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components.<n>To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations.
arXiv Detail & Related papers (2025-05-28T18:39:56Z) - Split Matching for Inductive Zero-shot Semantic Segmentation [52.90218623214213]
Zero-shot Semantic (ZSS) aims to segment categories that are not annotated during training.<n>We propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components.<n>SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.
arXiv Detail & Related papers (2025-05-08T07:56:30Z) - Second language Korean Universal Dependency treebank v1.2: Focus on data augmentation and annotation scheme refinement [0.0]
We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences.<n>We fine-tune three Korean language models and evaluate their performance on in-domain and out-of-domain L2-Korean datasets.
arXiv Detail & Related papers (2025-03-18T20:42:42Z) - KPC-cF: Aspect-Based Sentiment Analysis via Implicit-Feature Alignment with Corpus Filtering [0.0]
Our research proposes an intuitive and effective framework for ABSA in low-resource languages such as Korean.<n>It optimize prediction labels by integrating translated benchmark and unlabeled Korean data.<n>Compared to English ABSA, our framework showed an approximately 3% difference in F1 scores and accuracy.
arXiv Detail & Related papers (2024-06-29T07:01:51Z) - Dual-Alignment Pre-training for Cross-lingual Sentence Embedding [79.98111074307657]
We propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding.
We introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.
Our approach can significantly improve sentence embedding.
arXiv Detail & Related papers (2023-05-16T03:53:30Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - A Massively Multilingual Analysis of Cross-linguality in Shared
Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space.
We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance.
We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z) - X2Parser: Cross-Lingual and Cross-Domain Framework for Task-Oriented
Compositional Semantic Parsing [51.81533991497547]
Task-oriented compositional semantic parsing (TCSP) handles complex nested user queries.
We present X2 compared a transferable Cross-lingual and Cross-domain for TCSP.
We propose to predict flattened intents and slots representations separately and cast both prediction tasks into sequence labeling problems.
arXiv Detail & Related papers (2021-06-07T16:40:05Z) - Incorporating Visual Layout Structures for Scientific Text
Classification [31.15058113053433]
We introduce new methods for incorporating VIsual LAyout structures (VILA), e.g., the grouping of page texts into text lines or text blocks, into language models.
We show that the I-VILA approach, which simply adds special tokens denoting boundaries between layout structures into model inputs, can lead to +14.5 F1 Score improvements in token classification tasks.
arXiv Detail & Related papers (2021-06-01T17:59:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.