Automatic Difficulty Classification of Arabic Sentences
- URL: http://arxiv.org/abs/2103.04386v1
- Date: Sun, 7 Mar 2021 16:02:04 GMT
- Title: Automatic Difficulty Classification of Arabic Sentences
- Authors: Nouran Khallaf, Serge Sharoff
- Abstract summary: The accuracy of our 3-way CEFR classification is F-1 of 0.80 and 0.75 for Arabic-Bert and XLM-R classification respectively and 0.71 Spearman correlation for regression.
We compare the use of sentence embeddings of different kinds (fastText, mBERT, XLM-R and Arabic-BERT) as well as traditional language features such as POS tags, dependency trees, readability scores and frequency lists for language learners.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a Modern Standard Arabic (MSA) Sentence difficulty
classifier, which predicts the difficulty of sentences for language learners
using either the CEFR proficiency levels or the binary classification as simple
or complex. We compare the use of sentence embeddings of different kinds
(fastText, mBERT , XLM-R and Arabic-BERT), as well as traditional language
features such as POS tags, dependency trees, readability scores and frequency
lists for language learners. Our best results have been achieved using
fined-tuned Arabic-BERT. The accuracy of our 3-way CEFR classification is F-1
of 0.80 and 0.75 for Arabic-Bert and XLM-R classification respectively and 0.71
Spearman correlation for regression. Our binary difficulty classifier reaches
F-1 0.94 and F-1 0.98 for sentence-pair semantic similarity classifier.
Related papers
- Ta'keed: The First Generative Fact-Checking System for Arabic Claims [0.0]
This paper introduces Ta'keed, an explainable Arabic automatic fact-checking system.
Ta'keed generates explanations for claim credibility, particularly in Arabic.
The system achieved a promising F1 score of 0.72 in the classification task.
arXiv Detail & Related papers (2024-01-25T10:43:00Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - Language Model Classifier Aligns Better with Physician Word Sensitivity
than XGBoost on Readmission Prediction [86.15787587540132]
We introduce sensitivity score, a metric that scrutinizes models' behaviors at the vocabulary level.
Our experiments compare the decision-making logic of clinicians and classifiers based on rank correlations of sensitivity scores.
arXiv Detail & Related papers (2022-11-13T23:59:11Z) - Multilevel sentiment analysis in arabic [1.4467794332678539]
The average F-score achieved in the term level SA for both positive and negative testing classes is 0.92.
In the document level SA, the average F-score for positive testing classes is 0.94, while for negative classes is 0.93.
arXiv Detail & Related papers (2022-05-24T19:16:06Z) - Towards Arabic Sentence Simplification via Classification and Generative
Approaches [0.0]
This paper presents an attempt to build a Modern Standard Arabic (MSA) sentence-level simplification system.
We experimented with sentence simplification using two approaches: (i) a classification approach leading to lexical simplification pipelines which use Arabic-BERT, a pre-trained contextualised model, as well as a model of fastText word embeddings; and (ii) a generative approach, a Seq2Seq technique by applying a multilingual Text-to-Text Transfer Transformer mT5.
arXiv Detail & Related papers (2022-04-20T08:17:33Z) - VALUE: Understanding Dialect Disparity in NLU [50.35526025326337]
We construct rules for 11 features of African American Vernacular English (AAVE)
We recruit fluent AAVE speakers to validate each feature transformation via linguistic acceptability judgments.
Experiments show that these new dialectal features can lead to a drop in model performance.
arXiv Detail & Related papers (2022-04-06T18:30:56Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Language Identification with a Reciprocal Rank Classifier [1.4467794332678539]
We present a lightweight and effective language identifier that is robust to changes of domain and to the absence of training data.
We test this on two 22-language data sets and demonstrate zero-effort domain adaptation from a Wikipedia training set to a Twitter test set.
arXiv Detail & Related papers (2021-09-20T22:10:07Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - NLP-CIC at SemEval-2020 Task 9: Analysing sentiment in code-switching
language using a simple deep-learning classifier [63.137661897716555]
Code-switching is a phenomenon in which two or more languages are used in the same message.
We use a standard convolutional neural network model to predict the sentiment of tweets in a blend of Spanish and English languages.
arXiv Detail & Related papers (2020-09-07T19:57:09Z) - Text Complexity Classification Based on Linguistic Information:
Application to Intelligent Tutoring of ESL [0.0]
The goal of this work is to build a classifier that can identify text complexity within the context of teaching reading to English as a Second Language ( ESL) learners.
Using a corpus of 6171 texts, which had already been classified into three different levels of difficulty by ESL experts, different experiments were conducted with five machine learning algorithms.
The results showed that the adopted linguistic features provide a good overall classification performance.
arXiv Detail & Related papers (2020-01-07T02:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.