Related papers: Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection

Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection

URL: http://arxiv.org/abs/2503.17739v2
Date: Tue, 10 Jun 2025 15:32:29 GMT
Title: Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection
Authors: Chatrine Qwaider, Bashar Alhafni, Kirill Chirkunov, Nizar Habash, Ted Briscoe,
Abstract summary: Automated Essay Scoring (AES) plays a crucial role in assessing language learners' writing quality, reducing grading workload, and providing real-time feedback.<n>This paper leverages Large Language Models (LLMs) and Transformer models to generate synthetic Arabic essays for AES.<n>We create a dataset of 3,040 annotated essays with errors injected using our two methods.
Score: 10.198081881605226
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated Essay Scoring (AES) plays a crucial role in assessing language learners' writing quality, reducing grading workload, and providing real-time feedback. The lack of annotated essay datasets inhibits the development of Arabic AES systems. This paper leverages Large Language Models (LLMs) and Transformer models to generate synthetic Arabic essays for AES. We prompt an LLM to generate essays across the Common European Framework of Reference (CEFR) proficiency levels and introduce and compare two approaches to error injection. We create a dataset of 3,040 annotated essays with errors injected using our two methods. Additionally, we develop a BERT-based Arabic AES system calibrated to CEFR levels. Our experimental results demonstrate the effectiveness of our synthetic dataset in improving Arabic AES performance. We make our code and data publicly available.

Related papers

Qayyem: A Real-time Platform for Scoring Proficiency of Arabic Essays [5.404427910866254]
We present Qayyem, a Web-based platform designed to support Arabic AES.<n>Qayyem provides an integrated workflow for assignment creation, batch essay upload, scoring configuration, and per-trait essay evaluation.<n>The platform deploys a number of state-of-the-art Arabic essay scoring models with different effectiveness and efficiency figures.
arXiv Detail & Related papers (2026-03-01T09:26:47Z)
LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring [7.121813878009244]
LAILA is the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar.<n>We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings.
arXiv Detail & Related papers (2025-12-30T13:49:52Z)
Automatic Essay Scoring and Feedback Generation in Basque Language Learning [4.218073067465283]
This paper introduces the first publicly available dataset for Automatic Essay Scoring (AES) and feedback generation in Basque, targeting the CEFR C1 proficiency level.<n>The dataset comprises 3,200 essays from HABE, each annotated by expert evaluators with criterion specific scores covering correctness, richness, coherence, cohesion, and task alignment enriched with detailed feedback and error examples.<n>We fine-tune open-source models, including RoBERTa-EusCrawl and Latxa 8B/70B, for both scoring and explanation generation.
arXiv Detail & Related papers (2025-12-09T15:28:35Z)
KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [57.08591486199925]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z)
Sadeed: Advancing Arabic Diacritization Through Small Language Model [0.0]
We introduce Sadeed, a novel decoder-only language model for Arabic diacritization. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. We introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels.
arXiv Detail & Related papers (2025-04-30T13:37:24Z)
Rank-Then-Score: Enhancing Large Language Models for Automated Essay Scoring [6.459215652021233]
We propose Rank-Then-Score (RTS), a fine-tuning framework based on large language models to enhance their essay scoring capabilities.<n> Experimental results on two benchmark datasets, HSK and ASAP, demonstrate that RTS consistently outperforms the direct prompting (Vanilla) method in terms of average QWK.
arXiv Detail & Related papers (2025-04-08T07:10:51Z)
How well can LLMs Grade Essays in Arabic? [3.101490720236325]
This research assesses the effectiveness of large language models (LLMs) in the task of Arabic automated essay scoring (AES) using the AR-AES dataset.<n>It explores various evaluation methodologies, including zero-shot, few-shot in-context learning, and fine-tuning.<n>A mixed-language prompting strategy, integrating English prompts with Arabic content, was implemented to improve model comprehension and performance.
arXiv Detail & Related papers (2025-01-27T21:30:02Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
CATT: Character-based Arabic Tashkeel Transformer [0.0]
Tashkeel, or Arabic Text Diacritization, greatly enhances the comprehension of Arabic text. This paper introduces a new approach to training ATD models. We evaluate our models alongside 11 commercial and open-source models.
arXiv Detail & Related papers (2024-07-03T16:05:20Z)
CELA: Cost-Efficient Language Model Alignment for CTR Prediction [70.65910069412944]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.<n>Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)<n>We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z)
AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
An Empirical Study of Automatic Post-Editing [56.86393786396992]
APE aims to reduce manual post-editing efforts by automatically correcting errors in machine-translated output. To alleviate the lack of genuine training data, most of the current APE systems employ data augmentation methods to generate large-scale artificial corpora. We study the outputs of the state-of-art APE model on a difficult APE dataset to analyze the problems in existing APE systems.
arXiv Detail & Related papers (2022-09-16T07:38:27Z)
Revisiting Pre-trained Language Models and their Evaluation for Arabic Natural Language Understanding [44.048072667378115]
Existing Arabic PLMs are not well-explored and their pre-trainig can be improved significantly. There is a lack of systematic and reproducible evaluation of these models in the literature. We show that our models significantly outperform existing Arabic PLMs and achieve a new state-of-the-art performance on discriminative and generative Arabic NLU and NLG tasks.
arXiv Detail & Related papers (2022-05-21T22:38:19Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.