BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment
in Central Philippine Languages
- URL: http://arxiv.org/abs/2310.11584v1
- Date: Tue, 17 Oct 2023 21:05:20 GMT
- Title: BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment
in Central Philippine Languages
- Authors: Joseph Marvin Imperial, Ekaterina Kochmar
- Abstract summary: We introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines.
We compiled a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada languages.
We propose a new hierarchical cross-lingual modeling approach that takes advantage of a language's placement in the family tree to increase the amount of available training data.
- Score: 8.64545246732563
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Current research on automatic readability assessment (ARA) has focused on
improving the performance of models in high-resource languages such as English.
In this work, we introduce and release BasahaCorpus as part of an initiative
aimed at expanding available corpora and baseline models for readability
assessment in lower resource languages in the Philippines. We compiled a corpus
of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and
Rinconada -- languages belonging to the Central Philippine family tree subgroup
-- to train ARA models using surface-level, syllable-pattern, and n-gram
overlap features. We also propose a new hierarchical cross-lingual modeling
approach that takes advantage of a language's placement in the family tree to
increase the amount of available training data. Our study yields encouraging
results that support previous work showcasing the efficacy of cross-lingual
models in low-resource settings, as well as similarities in highly informative
linguistic features for mutually intelligible languages.
Related papers
- MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - Compass: Large Multilingual Language Model for South-east Asia [0.0]
CompassLLM is a large multilingual model specifically tailored for Southeast Asian languages.
Our model exhibits its superior performance in South-east Asia languages, such as Indonesian language.
arXiv Detail & Related papers (2024-04-14T11:48:33Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - CebuaNER: A New Baseline Cebuano Named Entity Recognition Model [1.5056924758531152]
We introduce CebuaNER, a new baseline model for named entity recognition in the Cebuano language.
To build the model, we collected and annotated over 4,000 news articles, the largest of any work in the language.
Our findings show promising results as a new baseline model, achieving over 70% performance on precision, recall, and F1 across all entity tags.
arXiv Detail & Related papers (2023-10-01T14:09:42Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Automatic Readability Assessment for Closely Related Languages [6.233117407988574]
This work focuses on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting.
We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models.
Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models.
arXiv Detail & Related papers (2023-05-22T20:42:53Z) - A Baseline Readability Model for Cebuano [0.0]
We developed the first baseline readability model for the Cebuano language.
Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers.
arXiv Detail & Related papers (2022-03-31T17:49:11Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.