Strategies for Arabic Readability Modeling
- URL: http://arxiv.org/abs/2407.03032v1
- Date: Wed, 3 Jul 2024 11:54:11 GMT
- Title: Strategies for Arabic Readability Modeling
- Authors: Juan PiƱeros Liberato, Bashar Alhafni, Muhamed Al Khalil, Nizar Habash,
- Abstract summary: Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility.
We present a set of experimental results on Arabic readability assessment using a diverse range of approaches.
- Score: 9.976720880041688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility. However, Arabic readability assessment is a challenging task due to Arabic's morphological richness and limited readability resources. In this paper, we present a set of experimental results on Arabic readability assessment using a diverse range of approaches, from rule-based methods to Arabic pretrained language models. We report our results on a newly created corpus at different textual granularity levels (words and sentence fragments). Our results show that combining different techniques yields the best results, achieving an overall macro F1 score of 86.7 at the word level and 87.9 at the fragment level on a blind test set. We make our code, data, and pretrained models publicly available.
Related papers
- Guidelines for Fine-grained Sentence-level Arabic Readability Annotation [9.261022921574318]
The Balanced Arabic Readability Evaluation Corpus (BAREC) project is designed to address the need for comprehensive Arabic language resources aligned with diverse readability levels.
Inspired by the Taha/Arabi21 readability reference, BAREC aims to provide a standardized reference for assessing sentence-level Arabic text readability across 19 distinct levels.
This paper focuses on our meticulous annotation guidelines, demonstrated through the analysis of 10,631 sentences/phrases (113,651 words)
arXiv Detail & Related papers (2024-10-11T09:59:46Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Arabic Sentiment Analysis with Noisy Deep Explainable Model [48.22321420680046]
This paper proposes an explainable sentiment classification framework for the Arabic language.
The proposed framework can explain specific predictions by training a local surrogate explainable model.
We carried out experiments on public benchmark Arabic SA datasets.
arXiv Detail & Related papers (2023-09-24T19:26:53Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Take the Hint: Improving Arabic Diacritization with
Partially-Diacritized Text [4.863310073296471]
We propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions.
We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking.
arXiv Detail & Related papers (2023-06-06T10:18:17Z) - ORCA: A Challenging Benchmark for Arabic Language Understanding [8.9379057739817]
ORCA is a publicly available benchmark for Arabic language understanding evaluation.
To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models.
arXiv Detail & Related papers (2022-12-21T04:35:43Z) - A Transfer Learning Based Model for Text Readability Assessment in
German [4.550811027560416]
We propose a new model for text complexity assessment for German text based on transfer learning.
Best model is based on the BERT pre-trained language model achieved the Root Mean Square Error (RMSE) of 0.483.
arXiv Detail & Related papers (2022-07-13T15:15:44Z) - Efficient Measuring of Readability to Improve Documents Accessibility
for Arabic Language Learners [0.0]
The approach is based on machine learning classification methods to discriminate between different levels of difficulty in reading and understanding a text.
Several models were trained on a large corpus mined from online Arabic websites and manually annotated.
Best results were achieved using TF-IDF Vectors trained by a combination of word-based unigrams and bigrams with an overall accuracy of 87.14% over four classes of complexity.
arXiv Detail & Related papers (2021-09-09T10:05:38Z) - Deep Learning Based Text Classification: A Comprehensive Review [75.8403533775179]
We provide a review of more than 150 deep learning based models for text classification developed in recent years.
We also provide a summary of more than 40 popular datasets widely used for text classification.
arXiv Detail & Related papers (2020-04-06T02:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.