Thai Universal Dependency Treebank
- URL: http://arxiv.org/abs/2405.07586v1
- Date: Mon, 13 May 2024 09:48:13 GMT
- Title: Thai Universal Dependency Treebank
- Authors: Panyur Sriwirote, Wei Qi Leong, Charin Polpanumas, Santhawat Thanyawong, William Chandra Tjhi, Wirote Aroonmanakun, Attapol T. Rutherford,
- Abstract summary: We introduce Thai Universal Dependency Treebank (TUD), a new largest Thai treebank consisting of 3,627 trees annotated in accordance with the Universal Dependencies (UD) framework.
We then benchmark dependency parsing models that incorporate pretrained encoders and train them on Thai-PUD and our TUD.
The results show that most of our models can outperform other models reported in previous papers and provide insight into the optimal choices of components in Thai dependencys.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic dependency parsing of Thai sentences has been underexplored, as evidenced by the lack of large Thai dependency treebanks with complete dependency structures and the lack of a published systematic evaluation of state-of-the-art models, especially transformer-based parsers. In this work, we address these problems by introducing Thai Universal Dependency Treebank (TUD), a new largest Thai treebank consisting of 3,627 trees annotated in accordance with the Universal Dependencies (UD) framework. We then benchmark dependency parsing models that incorporate pretrained transformers as encoders and train them on Thai-PUD and our TUD. The evaluation results show that most of our models can outperform other models reported in previous papers and provide insight into the optimal choices of components to include in Thai dependency parsers. The new treebank and every model's full prediction generated in our experiment are made available on a GitHub repository for further study.
Related papers
- Modern Neighborhood Components Analysis: A Deep Tabular Baseline Two Decades Later [59.88557193062348]
We revisit the classic Neighborhood Component Analysis (NCA), designed to learn a linear projection that captures semantic similarities between instances.
We find that minor modifications, such as adjustments to the learning objectives and the integration of deep learning architectures, significantly enhance NCA's performance.
We also introduce a neighbor sampling strategy that improves both the efficiency and predictive accuracy of our proposed ModernNCA.
arXiv Detail & Related papers (2024-07-03T16:38:57Z) - Dependency Annotation of Ottoman Turkish with Multilingual BERT [0.0]
This study introduces a pretrained large language model-based annotation methodology for the first dency treebank in Ottoman Turkish.
The resulting treebank will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.
arXiv Detail & Related papers (2024-02-22T17:58:50Z) - Enhancements to the BOUN Treebank Reflecting the Agglutinative Nature of
Turkish [0.6514569292630354]
We aim to resolve the issues of the lack of representation of null morphemes, highly productive derivational processes, and syncretic morphemes of Turkish in the BOUN Treebank without diverging from the Universal Dependencies framework.
New annotation conventions were introduced by splitting certain lemmas and employing the MISC (miscellaneous) tab in the UD framework to denote derivation.
Representational capabilities of the re-annotated treebank were tested on a LSTM-based dependency and an updated version of the BoAT Tool is introduced.
arXiv Detail & Related papers (2022-07-24T17:56:27Z) - Unsupervised and Few-shot Parsing from Pretrained Language Models [56.33247845224995]
We propose an Unsupervised constituent Parsing model that calculates an Out Association score solely based on the self-attention weight matrix learned in a pretrained language model.
We extend the unsupervised models to few-shot parsing models that use a few annotated trees to learn better linear projection matrices for parsing.
Our few-shot parsing model FPIO trained with only 20 annotated trees outperforms a previous few-shot parsing method trained with 50 annotated trees.
arXiv Detail & Related papers (2022-06-10T10:29:15Z) - Out-of-Domain Evaluation of Finnish Dependency Parsing [0.8957681069740162]
In many real world applications the data on which the model is applied may very substantially differ from the characteristics of the training data.
In this paper, we focus on Finnish out-of-domain parsing by introducing a novel UD Finnish-OOD out-of-domain treebank.
We present extensive out-of-domain evaluation utilizing the available section-level information from three different UD treebanks.
arXiv Detail & Related papers (2022-04-22T10:34:19Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - Linguistic dependencies and statistical dependence [76.89273585568084]
We use pretrained language models to estimate probabilities of words in context.
We find that maximum-CPMI trees correspond to linguistic dependencies more often than trees extracted from non-contextual PMI estimate.
arXiv Detail & Related papers (2021-04-18T02:43:37Z) - Constructing Taxonomies from Pretrained Language Models [52.53846972667636]
We present a method for constructing taxonomic trees (e.g., WordNet) using pretrained language models.
Our approach is composed of two modules, one that predicts parenthood relations and another that reconciles those predictions into trees.
We train our model on subtrees sampled from WordNet, and test on non-overlapping WordNet subtrees.
arXiv Detail & Related papers (2020-10-24T07:16:21Z) - Recursive Top-Down Production for Sentence Generation with Latent Trees [77.56794870399288]
We model the production property of context-free grammars for natural and synthetic languages.
We present a dynamic programming algorithm that marginalises over latent binary tree structures with $N$ leaves.
We also present experimental results on German-English translation on the Multi30k dataset.
arXiv Detail & Related papers (2020-10-09T17:47:16Z) - Analysis of the Penn Korean Universal Dependency Treebank (PKT-UD):
Manual Revision to Build Robust Parsing Model in Korean [15.899449418195106]
We first open on important issues regarding the Penn Korean Universal Treebank (PKT-UD)
We address these issues by revising the entire corpus manually with the aim of producing cleaner UD annotations.
For compatibility to the rest of UD corpora, we extensively revise the part-of-speech tags and the dependency relations.
arXiv Detail & Related papers (2020-05-26T17:46:46Z) - Universal Dependencies according to BERT: both more specific and more
general [4.63257209402195]
This work focuses on analyzing the form and extent of syntactic abstraction captured by BERT by extracting labeled dependency trees from self-attentions.
We extend these findings by explicitly comparing BERT relations to Universal Dependencies (UD) annotations, showing that they often do not match one-to-one.
Our approach produces significantly more consistent dependency trees than previous work, showing that it better explains the syntactic abstractions in BERT.
arXiv Detail & Related papers (2020-04-30T07:48:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.