Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank
and the BoAT Annotation Tool
- URL: http://arxiv.org/abs/2002.10416v2
- Date: Thu, 16 Sep 2021 09:53:43 GMT
- Title: Resources for Turkish Dependency Parsing: Introducing the BOUN Treebank
and the BoAT Annotation Tool
- Authors: Utku T\"urk (1), Furkan Atmaca (1), \c{S}aziye Bet\"ul \"Ozate\c{s}
(2), G\"ozde Berk (2), Seyyit Talha Bedir (1), Abdullatif K\"oksal (2),
Balk{\i}z \"Ozt\"urk Ba\c{s}aran (1), Tunga G\"ung\"or (2) and Arzucan
\"Ozg\"ur (2) ((1) Department of Linguistics Bo\u{g}azi\c{c}i University, (2)
Department of Computer Engineering Bo\u{g}azi\c{c}i University)
- Abstract summary: We introduce the resources that we developed for Turkish dependency parsing, which include a novel manually annotated treebank (BOUN Treebank)
Decisions regarding the annotation of the BOUN Treebank were made in line with the Universal Dependencies (UD) framework.
We report the results of a state-of-the-art dependency annotation obtained over the BOUN Treebank as well as two other treebanks in Turkish.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce the resources that we developed for Turkish
dependency parsing, which include a novel manually annotated treebank (BOUN
Treebank), along with the guidelines we adopted, and a new annotation tool
(BoAT). The manual annotation process we employed was shaped and implemented by
a team of four linguists and five Natural Language Processing (NLP)
specialists. Decisions regarding the annotation of the BOUN Treebank were made
in line with the Universal Dependencies (UD) framework as well as our recent
efforts for unifying the Turkish UD treebanks through manual re-annotation. To
the best of our knowledge, BOUN Treebank is the largest Turkish treebank. It
contains a total of 9,761 sentences from various topics including biographical
texts, national newspapers, instructional texts, popular culture articles, and
essays. In addition, we report the parsing results of a state-of-the-art
dependency parser obtained over the BOUN Treebank as well as two other
treebanks in Turkish. Our results demonstrate that the unification of the
Turkish annotation scheme and the introduction of a more comprehensive treebank
lead to improved performance with regard to dependency parsing.
Related papers
- MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank [56.810282574817414]
We present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in Universal Dependencies (UD)
We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers' orthographies.
Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries.
arXiv Detail & Related papers (2024-03-15T13:33:10Z) - Dependency Annotation of Ottoman Turkish with Multilingual BERT [0.0]
This study introduces a pretrained large language model-based annotation methodology for the first dency treebank in Ottoman Turkish.
The resulting treebank will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.
arXiv Detail & Related papers (2024-02-22T17:58:50Z) - Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z) - Enhancements to the BOUN Treebank Reflecting the Agglutinative Nature of
Turkish [0.6514569292630354]
We aim to resolve the issues of the lack of representation of null morphemes, highly productive derivational processes, and syncretic morphemes of Turkish in the BOUN Treebank without diverging from the Universal Dependencies framework.
New annotation conventions were introduced by splitting certain lemmas and employing the MISC (miscellaneous) tab in the UD framework to denote derivation.
Representational capabilities of the re-annotated treebank were tested on a LSTM-based dependency and an updated version of the BoAT Tool is introduced.
arXiv Detail & Related papers (2022-07-24T17:56:27Z) - Building an Endangered Language Resource in the Classroom: Universal
Dependencies for Kakataibo [0.8938910048099864]
We launch a new Universal Dependencies treebank for an endangered language from Amazonia: Kakataibo, a Panoan language spoken in Peru.
We first discuss the collaborative methodology implemented, which proved effective to create a treebank in the context of a Computational Linguistic course for undergraduates.
arXiv Detail & Related papers (2022-06-21T12:58:56Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Learning compositional structures for semantic graph parsing [81.41592892863979]
We show how AM dependency parsing can be trained directly on a neural latent-variable model.
Our model picks up on several linguistic phenomena on its own and achieves comparable accuracy to supervised training.
arXiv Detail & Related papers (2021-06-08T14:20:07Z) - Apurin\~a Universal Dependencies Treebank [0.4893345190925178]
This paper presents and discusses the first Universal Dependencies treebank for the Apurina language.
The treebank contains 76 fully annotated sentences, applies 14 parts-of-speech, as well as seven augmented or new features.
arXiv Detail & Related papers (2021-06-07T07:42:00Z) - Treebanking User-Generated Content: a UD Based Overview of Guidelines,
Corpora and Unified Recommendations [58.50167394354305]
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media.
It proposes a set of tentative UD-based annotation guidelines to promote consistent treatment of the particular phenomena found in these types of texts.
arXiv Detail & Related papers (2020-11-03T23:34:42Z) - Strongly Incremental Constituency Parsing with Graph Neural Networks [70.16880251349093]
Parsing sentences into syntax trees can benefit downstream applications in NLP.
Transition-baseds build trees by executing actions in a state transition system.
Existing transition-baseds are predominantly based on the shift-reduce transition system.
arXiv Detail & Related papers (2020-10-27T19:19:38Z) - Recursive Top-Down Production for Sentence Generation with Latent Trees [77.56794870399288]
We model the production property of context-free grammars for natural and synthetic languages.
We present a dynamic programming algorithm that marginalises over latent binary tree structures with $N$ leaves.
We also present experimental results on German-English translation on the Multi30k dataset.
arXiv Detail & Related papers (2020-10-09T17:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.