CebuaNER: A New Baseline Cebuano Named Entity Recognition Model
- URL: http://arxiv.org/abs/2310.00679v1
- Date: Sun, 1 Oct 2023 14:09:42 GMT
- Title: CebuaNER: A New Baseline Cebuano Named Entity Recognition Model
- Authors: Ma. Beatrice Emanuela Pilar, Ellyza Mari Papas, Mary Loise
Buenaventura, Dane Dedoroy, Myron Darrel Montefalcon, Jay Rhald Padilla, Lany
Maceda, Mideth Abisado, Joseph Marvin Imperial
- Abstract summary: We introduce CebuaNER, a new baseline model for named entity recognition in the Cebuano language.
To build the model, we collected and annotated over 4,000 news articles, the largest of any work in the language.
Our findings show promising results as a new baseline model, achieving over 70% performance on precision, recall, and F1 across all entity tags.
- Score: 1.5056924758531152
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite being one of the most linguistically diverse groups of countries,
computational linguistics and language processing research in Southeast Asia
has struggled to match the level of countries from the Global North. Thus,
initiatives such as open-sourcing corpora and the development of baseline
models for basic language processing tasks are important stepping stones to
encourage the growth of research efforts in the field. To answer this call, we
introduce CebuaNER, a new baseline model for named entity recognition (NER) in
the Cebuano language. Cebuano is the second most-used native language in the
Philippines, with over 20 million speakers. To build the model, we collected
and annotated over 4,000 news articles, the largest of any work in the
language, retrieved from online local Cebuano platforms to train algorithms
such as Conditional Random Field and Bidirectional LSTM. Our findings show
promising results as a new baseline model, achieving over 70% performance on
precision, recall, and F1 across all entity tags, as well as potential efficacy
in a crosslingual setup with Tagalog.
Related papers
- BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment
in Central Philippine Languages [8.64545246732563]
We introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines.
We compiled a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada languages.
We propose a new hierarchical cross-lingual modeling approach that takes advantage of a language's placement in the family tree to increase the amount of available training data.
arXiv Detail & Related papers (2023-10-17T21:05:20Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean
Language Models [6.907247943327277]
Polyglot is a pioneering project aimed at enhancing the non-English language performance of multilingual language models.
We introduce the Polyglot Korean models, which represent a specific focus rather than being multilingual in nature.
arXiv Detail & Related papers (2023-06-04T04:04:04Z) - Automatic Readability Assessment for Closely Related Languages [6.233117407988574]
This work focuses on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting.
We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models.
Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models.
arXiv Detail & Related papers (2023-05-22T20:42:53Z) - IndicSUPERB: A Speech Processing Universal Performance Benchmark for
Indian languages [16.121708272597154]
We release the IndicSUPERB benchmark for speech recognition in 12 Indian languages.
We train and evaluate different self-supervised models alongside a commonly used baseline benchmark.
We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks.
arXiv Detail & Related papers (2022-08-24T20:14:52Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - A Baseline Readability Model for Cebuano [0.0]
We developed the first baseline readability model for the Cebuano language.
Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers.
arXiv Detail & Related papers (2022-03-31T17:49:11Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.