Data Augmentation using Transformers and Similarity Measures for
Improving Arabic Text Classification
- URL: http://arxiv.org/abs/2212.13939v3
- Date: Sun, 20 Aug 2023 15:32:29 GMT
- Title: Data Augmentation using Transformers and Similarity Measures for
Improving Arabic Text Classification
- Authors: Dania Refai, Saleh Abo-Soud, Mohammad Abdel-Rahman
- Abstract summary: We propose a new Arabic DA method that employs the recent powerful modeling technique, namely the AraGPT-2.
The generated sentences are evaluated in terms of context, semantics, diversity, and novelty using the Euclidean, cosine, Jaccard, and BLEU distances.
The experiments were conducted on four sentiment Arabic datasets: AraSarcasm, ASTD, ATT, and MOVIE.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of learning models heavily relies on the availability and
adequacy of training data. To address the dataset adequacy issue, researchers
have extensively explored data augmentation (DA) as a promising approach. DA
generates new data instances through transformations applied to the available
data, thereby increasing dataset size and variability. This approach has
enhanced model performance and accuracy, particularly in addressing class
imbalance problems in classification tasks. However, few studies have explored
DA for the Arabic language, relying on traditional approaches such as
paraphrasing or noising-based techniques. In this paper, we propose a new
Arabic DA method that employs the recent powerful modeling technique, namely
the AraGPT-2, for the augmentation process. The generated sentences are
evaluated in terms of context, semantics, diversity, and novelty using the
Euclidean, cosine, Jaccard, and BLEU distances. Finally, the AraBERT
transformer is used on sentiment classification tasks to evaluate the
classification performance of the augmented Arabic dataset. The experiments
were conducted on four sentiment Arabic datasets: AraSarcasm, ASTD, ATT, and
MOVIE. The selected datasets vary in size, label number, and unbalanced
classes. The results show that the proposed methodology enhanced the Arabic
sentiment text classification on all datasets with an increase in F1 score by
4% in AraSarcasm, 6% in ASTD, 9% in ATT, and 13% in MOVIE.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation [8.777277201807351]
We present SaSPA: Structure and Subject Preserving Augmentation.
Our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity.
We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation methods.
arXiv Detail & Related papers (2024-06-20T17:58:30Z) - AugGPT: Leveraging ChatGPT for Text Data Augmentation [59.76140039943385]
We propose a text data augmentation approach based on ChatGPT (named AugGPT)
AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples.
Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach.
arXiv Detail & Related papers (2023-02-25T06:58:16Z) - DAGAM: Data Augmentation with Generation And Modification [3.063234089519162]
In pre-trained language models, under-fitting often occurs due to the size of the model being very large.
We introduce three data augmentation schemes that help reduce underfitting problems of large-scale language models.
arXiv Detail & Related papers (2022-04-06T07:20:45Z) - Empirical evaluation of shallow and deep learning classifiers for Arabic
sentiment analysis [1.1172382217477126]
This work presents a detailed comparison of the performance of deep learning models for sentiment analysis of Arabic reviews.
The datasets used in this study are multi-dialect Arabic hotel and book review datasets, which are some of the largest publicly available datasets for Arabic reviews.
Results showed deep learning outperforming shallow learning for binary and multi-label classification, in contrast with the results of similar work reported in the literature.
arXiv Detail & Related papers (2021-12-01T14:45:43Z) - Guiding Generative Language Models for Data Augmentation in Few-Shot
Text Classification [59.698811329287174]
We leverage GPT-2 for generating artificial training instances in order to improve classification performance.
Our results show that fine-tuning GPT-2 in a handful of label instances leads to consistent classification improvements.
arXiv Detail & Related papers (2021-11-17T12:10:03Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Data Augmentation Approaches in Natural Language Processing: A Survey [28.91744006146676]
Data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail.
One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data.
We frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling.
arXiv Detail & Related papers (2021-10-05T07:35:32Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - Zero-Resource Multi-Dialectal Arabic Natural Language Understanding [0.0]
We investigate the zero-shot performance on Dialectal Arabic (DA) when fine-tuning a pre-trained language model on modern standard Arabic (MSA) data only.
We propose self-training with unlabeled DA data and apply it in the context of named entity recognition (NER), part-of-speech (POS) tagging, and sarcasm detection (SRD)
Our results demonstrate the effectiveness of self-training with unlabeled DA data.
arXiv Detail & Related papers (2021-04-14T02:29:27Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.