Syntactic Transfer to Kyrgyz Using the Treebank Translation Method
- URL: http://arxiv.org/abs/2412.13146v1
- Date: Tue, 17 Dec 2024 18:12:33 GMT
- Title: Syntactic Transfer to Kyrgyz Using the Treebank Translation Method
- Authors: Anton Alekseev, Alina Tillabaeva, Gulnara Dzh. Kabaeva, Sergey I. Nikolenko,
- Abstract summary: This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz.
We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method.
- Score: 5.011924788933374
- License:
- Abstract: The Kyrgyz language, as a low-resource language, requires significant effort to create high-quality syntactic corpora. This study proposes an approach to simplify the development process of a syntactic corpus for Kyrgyz. We present a tool for transferring syntactic annotations from Turkish to Kyrgyz based on a treebank translation method. The effectiveness of the proposed tool was evaluated using the TueCL treebank. The results demonstrate that this approach achieves higher syntactic annotation accuracy compared to a monolingual model trained on the Kyrgyz KTMU treebank. Additionally, the study introduces a method for assessing the complexity of manual annotation for the resulting syntactic trees, contributing to further optimization of the annotation process.
Related papers
- Optimizing Large Language Models for Turkish: New Methodologies in Corpus Selection and Training [0.0]
We adapt Large Language Model generated datasets and translated English datasets into Turkish.
This approach led to substantial enhancements in model accuracy for both few-shot and zero-shot learning scenarios.
arXiv Detail & Related papers (2024-12-03T19:17:18Z) - BoolQuestions: Does Dense Retrieval Understand Boolean Logic in Language? [88.29075896295357]
We first investigate whether current retrieval systems can comprehend the Boolean logic implied in language.
Through extensive experimental results, we draw the conclusion that current dense retrieval systems do not fully understand Boolean logic in language.
We propose a contrastive continual training method that serves as a strong baseline for the research community.
arXiv Detail & Related papers (2024-11-19T05:19:53Z) - Dependency Annotation of Ottoman Turkish with Multilingual BERT [0.0]
This study introduces a pretrained large language model-based annotation methodology for the first dency treebank in Ottoman Turkish.
The resulting treebank will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.
arXiv Detail & Related papers (2024-02-22T17:58:50Z) - Unifying Structure and Language Semantic for Efficient Contrastive
Knowledge Graph Completion with Structured Entity Anchors [0.3913403111891026]
The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known.
We propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning.
arXiv Detail & Related papers (2023-11-07T11:17:55Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - HanoiT: Enhancing Context-aware Translation via Selective Context [95.93730812799798]
Context-aware neural machine translation aims to use the document-level context to improve translation quality.
The irrelevant or trivial words may bring some noise and distract the model from learning the relationship between the current sentence and the auxiliary context.
We propose a novel end-to-end encoder-decoder model with a layer-wise selection mechanism to sift and refine the long document context.
arXiv Detail & Related papers (2023-01-17T12:07:13Z) - LyS_ACoru\~na at SemEval-2022 Task 10: Repurposing Off-the-Shelf Tools
for Sentiment Analysis as Semantic Dependency Parsing [10.355938901584567]
This paper addresses the problem of structured sentiment analysis using a bi-affine semantic dependency.
For the monolingual setup, we considered: (i) training on a single treebank, and (ii) relaxing the setup by training on treebanks coming from different languages.
For the zero-shot setup and a given target treebank, we relied on: (i) a word-level translation of available treebanks in other languages to get noisy, unlikely-grammatical, but annotated data.
In the post-evaluation phase, we also trained cross-lingual models that simply merged all the English tree
arXiv Detail & Related papers (2022-04-27T10:21:28Z) - Syntactic Structure Distillation Pretraining For Bidirectional Encoders [49.483357228441434]
We introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining.
We distill the approximate marginal distribution over words in context from the syntactic LM.
Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data.
arXiv Detail & Related papers (2020-05-27T16:44:01Z) - Exploiting Syntactic Structure for Better Language Modeling: A Syntactic
Distance Approach [78.77265671634454]
We make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances"
Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.
arXiv Detail & Related papers (2020-05-12T15:35:00Z) - Reference Language based Unsupervised Neural Machine Translation [108.64894168968067]
unsupervised neural machine translation almost completely relieves the parallel corpus curse.
We propose a new reference language-based framework for UNMT, RUNMT, in which the reference language only shares a parallel corpus with the source.
Experimental results show that our methods improve the quality of UNMT over that of a strong baseline that uses only one auxiliary language.
arXiv Detail & Related papers (2020-04-05T08:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.