Developing a Named Entity Recognition Dataset for Tagalog
- URL: http://arxiv.org/abs/2311.07161v1
- Date: Mon, 13 Nov 2023 08:56:47 GMT
- Title: Developing a Named Entity Recognition Dataset for Tagalog
- Authors: Lester James V. Miranda
- Abstract summary: This dataset contains 7.8k documents across three entity types.
The inter-annotator agreement, as measured by Cohen's $kappa$, is 0.81.
We released the data and processing code publicly to inspire future work on Tagalog NLP.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the development of a Named Entity Recognition (NER) dataset for
Tagalog. This corpus helps fill the resource gap present in Philippine
languages today, where NER resources are scarce. The texts were obtained from a
pretraining corpora containing news reports, and were labeled by native
speakers in an iterative fashion. The resulting dataset contains ~7.8k
documents across three entity types: Person, Organization, and Location. The
inter-annotator agreement, as measured by Cohen's $\kappa$, is 0.81. We also
conducted extensive empirical evaluation of state-of-the-art methods across
supervised and transfer learning settings. Finally, we released the data and
processing code publicly to inspire future work on Tagalog NLP.
Related papers
- The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project [0.0]
This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually according to the Universal Dependencies framework.<n>We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures.
arXiv Detail & Related papers (2025-05-26T18:25:10Z) - ANCHOLIK-NER: A Benchmark Dataset for Bangla Regional Named Entity Recognition [0.8025340896297104]
The dataset has around 17,405 sentences, 3,481 sentences per region.
The data was collected from two publicly available datasets and through web scraping from various online newspapers, articles.
It can be utilized to enhance NER systems in Bangla dialects, improve regional language understanding, and support applications in machine translation, information retrieval, and conversational AI.
arXiv Detail & Related papers (2025-02-16T16:59:10Z) - WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages [62.1053122134059]
The paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages.
We have developed a systematic data processing framework tailored for low-resource languages.
arXiv Detail & Related papers (2025-01-24T14:06:29Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Naamapadam: A Large-Scale Named Entity Annotated Data for Indic
Languages [15.214673043019399]
The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories.
The training dataset has been automatically created from the Samanantar parallel corpus.
We release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set.
arXiv Detail & Related papers (2022-12-20T11:15:24Z) - NusaCrowd: Open Source Initiative for Indonesian NLP Resources [104.5381571820792]
NusaCrowd is a collaborative initiative to collect and unify existing resources for Indonesian languages.
Our work strives to advance natural language processing (NLP) research for languages that are under-represented despite being widely spoken.
arXiv Detail & Related papers (2022-12-19T17:28:22Z) - Benchmarking zero-shot and few-shot approaches for tokenization,
tagging, and dependency parsing of Tagalog text [0.0]
We investigate the use of auxiliary data sources for creating task-specific models in the absence of annotated Tagalog data.
We show that these zero-shot and few-shot approaches yield substantial improvements on grammatical analysis of both in-domain and out-of-domain Tagalog text.
arXiv Detail & Related papers (2022-08-03T02:20:10Z) - Part-of-Speech Tagging of Odia Language Using statistical and Deep
Learning-Based Approaches [0.0]
This research work is to present a conditional random field (CRF) and deep learning-based approaches (CNN and Bi-LSTM) to develop Odia part-of-speech tagger.
It has been observed that Bi-LSTM model with character sequence feature and pre-trained word vector achieved a significant state-of-the-art result.
arXiv Detail & Related papers (2022-07-07T12:15:23Z) - HiNER: A Large Hindi Named Entity Recognition Dataset [29.300418937509317]
This paper releases a standard-abiding Hindi NER dataset containing 109,146 sentences and 2,220,856 tokens, annotated with 11 tags.
The statistics of tag-set in our dataset show a healthy per-tag distribution, especially for prominent classes like Person, Location and Organisation.
Our dataset helps achieve a weighted F1 score of 88.78 with all the tags and 92.22 when we collapse the tag-set, as discussed in the paper.
arXiv Detail & Related papers (2022-04-28T19:14:21Z) - Label Semantics for Few Shot Named Entity Recognition [68.01364012546402]
We study the problem of few shot learning for named entity recognition.
We leverage the semantic information in the names of the labels as a way of giving the model additional signal and enriched priors.
Our model learns to match the representations of named entities computed by the first encoder with label representations computed by the second encoder.
arXiv Detail & Related papers (2022-03-16T23:21:05Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Development of a Dataset and a Deep Learning Baseline Named Entity
Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi [0.983719084224035]
Bhojpuri, Maithili and Magahi are low resource languages, usually known as Purvanchal languages.
This paper focuses on the development of a NER benchmark dataset for the Machine Translation systems developed to translate from these languages to Hindi by annotating parts of their available corpora.
arXiv Detail & Related papers (2020-09-14T14:07:50Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.