Application of Lexical Features Towards Improvement of Filipino
Readability Identification of Children's Literature
- URL: http://arxiv.org/abs/2101.10537v1
- Date: Fri, 22 Jan 2021 19:54:37 GMT
- Title: Application of Lexical Features Towards Improvement of Filipino
Readability Identification of Children's Literature
- Authors: Joseph Marvin Imperial, Ethel Ong
- Abstract summary: We explore the use of lexical features towards improving readability identification of children's books written in Filipino.
Results show that combining lexical features (LEX) consisting of type-token ratio, lexical density, lexical variation, foreign word count with traditional features (TRAD) increased the performance of readability models by almost a 5% margin.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proper identification of grade levels of children's reading materials is an
important step towards effective learning. Recent studies in readability
assessment for the English domain applied modern approaches in natural language
processing (NLP) such as machine learning (ML) techniques to automate the
process. There is also a need to extract the correct linguistic features when
modeling readability formulas. In the context of the Filipino language, limited
work has been done [1, 2], especially in considering the language's lexical
complexity as main features. In this paper, we explore the use of lexical
features towards improving the development of readability identification of
children's books written in Filipino. Results show that combining lexical
features (LEX) consisting of type-token ratio, lexical density, lexical
variation, foreign word count with traditional features (TRAD) used by previous
works such as sentence length, average syllable length, polysyllabic words,
word, sentence, and phrase counts increased the performance of readability
models by almost a 5% margin (from 42% to 47.2%). Further analysis and ranking
of the most important features were shown to identify which features contribute
the most in terms of reading complexity.
Related papers
- Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z) - Automatic Readability Assessment for Closely Related Languages [6.233117407988574]
This work focuses on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting.
We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models.
Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models.
arXiv Detail & Related papers (2023-05-22T20:42:53Z) - A Linguistic Investigation of Machine Learning based Contradiction
Detection Models: An Empirical Analysis and Future Perspectives [0.34998703934432673]
We analyze two Natural Language Inference data sets with respect to their linguistic features.
The goal is to identify those syntactic and semantic properties that are particularly hard to comprehend for a machine learning model.
arXiv Detail & Related papers (2022-10-19T10:06:03Z) - Unravelling Interlanguage Facts via Explainable Machine Learning [10.71581852108984]
We focus on the internals of an NLI classifier trained by an emphexplainable machine learning algorithm.
We use this perspective in order to tackle both NLI and a companion task, guessing whether a text has been written by a native or a non-native speaker.
We investigate which kind of linguistic traits are most effective for solving our two tasks, namely, are most indicative of a speaker's L1.
arXiv Detail & Related papers (2022-08-02T14:05:15Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Diverse Linguistic Features for Assessing Reading Difficulty of
Educational Filipino Texts [0.0]
This paper describes the development of automatic machine learning-based readability assessment models for educational Filipino texts.
Results show that using a Random Forest model obtained a high performance of 62.7% in terms of accuracy.
arXiv Detail & Related papers (2021-07-31T13:59:46Z) - Improving Cross-Lingual Reading Comprehension with Self-Training [62.73937175625953]
Current state-of-the-art models even surpass human performance on several benchmarks.
Previous works have revealed the abilities of pre-trained multilingual models for zero-shot cross-lingual reading comprehension.
This paper further utilized unlabeled data to improve the performance.
arXiv Detail & Related papers (2021-05-08T08:04:30Z) - Morphologically Aware Word-Level Translation [82.59379608647147]
We propose a novel morphologically aware probability model for bilingual lexicon induction.
Our model exploits the basic linguistic intuition that the lexeme is the key lexical unit of meaning.
arXiv Detail & Related papers (2020-11-15T17:54:49Z) - Probing Pretrained Language Models for Lexical Semantics [76.73599166020307]
We present a systematic empirical analysis across six typologically diverse languages and five different lexical tasks.
Our results indicate patterns and best practices that hold universally, but also point to prominent variations across languages and tasks.
arXiv Detail & Related papers (2020-10-12T14:24:01Z) - Linguistic Features for Readability Assessment [0.0]
It is unknown whether augmenting deep learning models with linguistically motivated features would improve performance further.
We find that, given sufficient training data, augmenting deep learning models with linguistically motivated features does not improve state-of-the-art performance.
Our results provide preliminary evidence for the hypothesis that the state-of-the-art deep learning models represent linguistic features of the text related to readability.
arXiv Detail & Related papers (2020-05-30T22:14:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.