Efficient Measuring of Readability to Improve Documents Accessibility
for Arabic Language Learners
- URL: http://arxiv.org/abs/2109.08648v1
- Date: Thu, 9 Sep 2021 10:05:38 GMT
- Title: Efficient Measuring of Readability to Improve Documents Accessibility
for Arabic Language Learners
- Authors: Sadik Bessou, Ghozlane Chenni
- Abstract summary: The approach is based on machine learning classification methods to discriminate between different levels of difficulty in reading and understanding a text.
Several models were trained on a large corpus mined from online Arabic websites and manually annotated.
Best results were achieved using TF-IDF Vectors trained by a combination of word-based unigrams and bigrams with an overall accuracy of 87.14% over four classes of complexity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an approach based on supervised machine learning methods
to build a classifier that can identify text complexity in order to present
Arabic language learners with texts suitable to their levels. The approach is
based on machine learning classification methods to discriminate between the
different levels of difficulty in reading and understanding a text. Several
models were trained on a large corpus mined from online Arabic websites and
manually annotated. The model uses both Count and TF-IDF representations and
applies five machine learning algorithms; Multinomial Naive Bayes, Bernoulli
Naive Bayes, Logistic Regression, Support Vector Machine and Random Forest,
using unigrams and bigrams features. With the goal of extracting the text
complexity, the problem is usually addressed by formulating the level
identification as a classification task. Experimental results showed that
n-gram features could be indicative of the reading level of a text and could
substantially improve performance, and showed that SVM and Multinomial Naive
Bayes are the most accurate in predicting the complexity level. Best results
were achieved using TF-IDF Vectors trained by a combination of word-based
unigrams and bigrams with an overall accuracy of 87.14% over four classes of
complexity.
Related papers
- Strategies for Arabic Readability Modeling [9.976720880041688]
Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility.
We present a set of experimental results on Arabic readability assessment using a diverse range of approaches.
arXiv Detail & Related papers (2024-07-03T11:54:11Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - Enhancing Pashto Text Classification using Language Processing
Techniques for Single And Multi-Label Analysis [0.0]
This study aims to establish an automated classification system for Pashto text.
The study achieved an average testing accuracy rate of 94%.
The use of pre-trained language representation models, such as DistilBERT, showed promising results.
arXiv Detail & Related papers (2023-05-04T23:11:31Z) - Tuning Traditional Language Processing Approaches for Pashto Text
Classification [0.0]
The main aim of this study is to establish a Pashto automatic text classification system.
This study compares several models containing both statistical and neural network machine learning techniques.
This research obtained average testing accuracy rate 94% using classification algorithm and TFIDF feature extraction method.
arXiv Detail & Related papers (2023-05-04T22:57:45Z) - A Unified Understanding of Deep NLP Models for Text Classification [88.35418976241057]
We have developed a visual analysis tool, DeepNLPVis, to enable a unified understanding of NLP models for text classification.
The key idea is a mutual information-based measure, which provides quantitative explanations on how each layer of a model maintains the information of input words in a sample.
A multi-level visualization, which consists of a corpus-level, a sample-level, and a word-level visualization, supports the analysis from the overall training set to individual samples.
arXiv Detail & Related papers (2022-06-19T08:55:07Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Classification of Chinese Handwritten Numbers with Labeled Projective
Dictionary Pair Learning [1.8594711725515674]
We design class-specific dictionaries incorporating three factors: discriminability, sparsity and classification error.
We adopt a new feature space, i.e., histogram of oriented gradients (HOG), to generate the dictionary atoms.
Results demonstrated enhanced classification performance $(sim98%)$ compared to state-of-the-art deep learning techniques.
arXiv Detail & Related papers (2020-03-26T01:43:59Z) - Text Complexity Classification Based on Linguistic Information:
Application to Intelligent Tutoring of ESL [0.0]
The goal of this work is to build a classifier that can identify text complexity within the context of teaching reading to English as a Second Language ( ESL) learners.
Using a corpus of 6171 texts, which had already been classified into three different levels of difficulty by ESL experts, different experiments were conducted with five machine learning algorithms.
The results showed that the adopted linguistic features provide a good overall classification performance.
arXiv Detail & Related papers (2020-01-07T02:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.