A Baseline Readability Model for Cebuano
- URL: http://arxiv.org/abs/2203.17225v1
- Date: Thu, 31 Mar 2022 17:49:11 GMT
- Title: A Baseline Readability Model for Cebuano
- Authors: Lloyd Lois Antonie Reyes, Michael Antonio Iba\~nez, Ranz Sapinit,
Mohammed Hussien, Joseph Marvin Imperial
- Abstract summary: We developed the first baseline readability model for the Cebuano language.
Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we developed the first baseline readability model for the
Cebuano language. Cebuano is the second most-used native language in the
Philippines with about 27.5 million speakers. As the baseline, we extracted
traditional or surface-based features, syllable patterns based from Cebuano's
documented orthography, and neural embeddings from the multilingual BERT model.
Results show that the use of the first two handcrafted linguistic features
obtained the best performance trained on an optimized Random Forest model with
approximately 84\% across all metrics. The feature sets and algorithm used also
is similar to previous results in readability assessment for the Filipino
language showing potential of crosslingual application. To encourage more work
for readability assessment in Philippine languages such as Cebuano, we
open-sourced both code and data.
Related papers
- BasahaCorpus: An Expanded Linguistic Resource for Readability Assessment
in Central Philippine Languages [8.64545246732563]
We introduce and release BasahaCorpus as part of an initiative aimed at expanding available corpora and baseline models for readability assessment in lower resource languages in the Philippines.
We compiled a corpus of short fictional narratives written in Hiligaynon, Minasbate, Karay-a, and Rinconada languages.
We propose a new hierarchical cross-lingual modeling approach that takes advantage of a language's placement in the family tree to increase the amount of available training data.
arXiv Detail & Related papers (2023-10-17T21:05:20Z) - CebuaNER: A New Baseline Cebuano Named Entity Recognition Model [1.5056924758531152]
We introduce CebuaNER, a new baseline model for named entity recognition in the Cebuano language.
To build the model, we collected and annotated over 4,000 news articles, the largest of any work in the language.
Our findings show promising results as a new baseline model, achieving over 70% performance on precision, recall, and F1 across all entity tags.
arXiv Detail & Related papers (2023-10-01T14:09:42Z) - Automatic Readability Assessment for Closely Related Languages [6.233117407988574]
This work focuses on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting.
We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models.
Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models.
arXiv Detail & Related papers (2023-05-22T20:42:53Z) - Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Diverse Linguistic Features for Assessing Reading Difficulty of
Educational Filipino Texts [0.0]
This paper describes the development of automatic machine learning-based readability assessment models for educational Filipino texts.
Results show that using a Random Forest model obtained a high performance of 62.7% in terms of accuracy.
arXiv Detail & Related papers (2021-07-31T13:59:46Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Improving the Lexical Ability of Pretrained Language Models for
Unsupervised Neural Machine Translation [127.81351683335143]
Cross-lingual pretraining requires models to align the lexical- and high-level representations of the two languages.
Previous research has shown that this is because the representations are not sufficiently aligned.
In this paper, we enhance the bilingual masked language model pretraining with lexical-level information by using type-level cross-lingual subword embeddings.
arXiv Detail & Related papers (2021-03-18T21:17:58Z) - Application of Lexical Features Towards Improvement of Filipino
Readability Identification of Children's Literature [0.0]
We explore the use of lexical features towards improving readability identification of children's books written in Filipino.
Results show that combining lexical features (LEX) consisting of type-token ratio, lexical density, lexical variation, foreign word count with traditional features (TRAD) increased the performance of readability models by almost a 5% margin.
arXiv Detail & Related papers (2021-01-22T19:54:37Z) - Constructing Taxonomies from Pretrained Language Models [52.53846972667636]
We present a method for constructing taxonomic trees (e.g., WordNet) using pretrained language models.
Our approach is composed of two modules, one that predicts parenthood relations and another that reconciles those predictions into trees.
We train our model on subtrees sampled from WordNet, and test on non-overlapping WordNet subtrees.
arXiv Detail & Related papers (2020-10-24T07:16:21Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.