The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
- URL: http://arxiv.org/abs/2505.20428v1
- Date: Mon, 26 May 2025 18:25:10 GMT
- Title: The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
- Authors: Angelina A. Aquino, Lester James V. Miranda, Elsie Marie T. Or,
- Abstract summary: This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually according to the Universal Dependencies framework.<n>We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according to the Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.
Related papers
- Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling [63.98194996746229]
Large language models (LLMs) are prone to hallucination and producing factually incorrect information.<n>We propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi-step reasoning problem integrated with search.
arXiv Detail & Related papers (2024-12-19T13:55:48Z) - Dependency Annotation of Ottoman Turkish with Multilingual BERT [0.0]
This study introduces a pretrained large language model-based annotation methodology for the first dency treebank in Ottoman Turkish.
The resulting treebank will facilitate automated analysis of Ottoman Turkish documents, unlocking the linguistic richness embedded in this historical heritage.
arXiv Detail & Related papers (2024-02-22T17:58:50Z) - Multilingual Nonce Dependency Treebanks: Understanding how Language Models represent and process syntactic structure [15.564927804136852]
SPUD (Semantically Perturbed Universal Dependencies) is a framework for creating nonce treebanks for the Universal Dependencies (UD) corpora.
We create nonce data in Arabic, English, French, German, and Russian, and demonstrate two use cases of SPUD treebanks.
arXiv Detail & Related papers (2023-11-13T17:36:58Z) - Developing a Named Entity Recognition Dataset for Tagalog [0.0]
This dataset contains 7.8k documents across three entity types.
The inter-annotator agreement, as measured by Cohen's $kappa$, is 0.81.
We released the data and processing code publicly to inspire future work on Tagalog NLP.
arXiv Detail & Related papers (2023-11-13T08:56:47Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z) - Uni-Parser: Unified Semantic Parser for Question Answering on Knowledge
Base and Database [86.03294330305097]
We propose a unified semantic element for question answering (QA) on both knowledge bases (KB) and databases (DB)
We introduce the primitive (relation and entity in KB, table name, column name and cell value in DB) as an essential element in our framework.
We leverage the generator to predict final logical forms by altering and composing topranked primitives with different operations.
arXiv Detail & Related papers (2022-11-09T19:33:27Z) - Discourse Analysis via Questions and Answers: Parsing Dependency
Structures of Questions Under Discussion [57.43781399856913]
This work adopts the linguistic framework of Questions Under Discussion (QUD) for discourse analysis.
We characterize relationships between sentences as free-form questions, in contrast to exhaustive fine-grained questions.
We develop the first-of-its-kind QUD that derives a dependency structure of questions over full documents.
arXiv Detail & Related papers (2022-10-12T03:53:12Z) - Benchmarking zero-shot and few-shot approaches for tokenization,
tagging, and dependency parsing of Tagalog text [0.0]
We investigate the use of auxiliary data sources for creating task-specific models in the absence of annotated Tagalog data.
We show that these zero-shot and few-shot approaches yield substantial improvements on grammatical analysis of both in-domain and out-of-domain Tagalog text.
arXiv Detail & Related papers (2022-08-03T02:20:10Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Treebanking User-Generated Content: a UD Based Overview of Guidelines,
Corpora and Unified Recommendations [58.50167394354305]
This article presents a discussion on the main linguistic phenomena which cause difficulties in the analysis of user-generated texts found on the web and in social media.
It proposes a set of tentative UD-based annotation guidelines to promote consistent treatment of the particular phenomena found in these types of texts.
arXiv Detail & Related papers (2020-11-03T23:34:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.