AraDIC: Arabic Document Classification using Image-Based Character
Embeddings and Class-Balanced Loss
- URL: http://arxiv.org/abs/2006.11586v1
- Date: Sat, 20 Jun 2020 14:25:06 GMT
- Title: AraDIC: Arabic Document Classification using Image-Based Character
Embeddings and Class-Balanced Loss
- Authors: Mahmoud Daif, Shunsuke Kitada, Hitoshi Iyatomi
- Abstract summary: We propose a novel end-to-end Arabic document classification framework, Arabic document image-based classifier (AraDIC)
AraDIC consists of an image-based character encoder and a classifier. They are trained in an end-to-end fashion using the class balanced loss to deal with the long-tailed data distribution problem.
To the best of our knowledge, this is the first image-based character embedding framework addressing the problem of Arabic text classification.
- Score: 7.734726150561088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classical and some deep learning techniques for Arabic text classification
often depend on complex morphological analysis, word segmentation, and
hand-crafted feature engineering. These could be eliminated by using
character-level features. We propose a novel end-to-end Arabic document
classification framework, Arabic document image-based classifier (AraDIC),
inspired by the work on image-based character embeddings. AraDIC consists of an
image-based character encoder and a classifier. They are trained in an
end-to-end fashion using the class balanced loss to deal with the long-tailed
data distribution problem. To evaluate the effectiveness of AraDIC, we created
and published two datasets, the Arabic Wikipedia title (AWT) dataset and the
Arabic poetry (AraP) dataset. To the best of our knowledge, this is the first
image-based character embedding framework addressing the problem of Arabic text
classification. We also present the first deep learning-based text classifier
widely evaluated on modern standard Arabic, colloquial Arabic and classical
Arabic. AraDIC shows performance improvement over classical and deep learning
baselines by 12.29% and 23.05% for the micro and macro F-score, respectively.
Related papers
- ATHAR: A High-Quality and Diverse Dataset for Classical Arabic to English Translation [1.8109081066789847]
Classical Arabic represents a significant era, encompassing the golden age of Arab culture, philosophy, and scientific literature.
We have identified a scarcity of translation datasets in Classical Arabic, which are often limited in scope and topics.
We present the ATHAR dataset, comprising 66,000 high-quality Classical Arabic to English translation samples.
arXiv Detail & Related papers (2024-07-29T09:45:34Z) - A multi-level multi-label text classification dataset of 19th century Ottoman and Russian literary and critical texts [8.405938712823563]
This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents.
The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian.
It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era.
arXiv Detail & Related papers (2024-07-21T12:14:45Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Huruf: An Application for Arabic Handwritten Character Recognition Using
Deep Learning [0.0]
We propose a lightweight Convolutional Neural Network-based architecture for recognizing Arabic characters and digits.
The proposed pipeline consists of a total of 18 layers containing four layers each for convolution, pooling, batch normalization, dropout, and finally one Global average layer.
The proposed model respectively achieved an accuracy of 96.93% and 99.35% which is comparable to the state-of-the-art and makes it a suitable solution for real-life end-level applications.
arXiv Detail & Related papers (2022-12-16T17:39:32Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - New Arabic Medical Dataset for Diseases Classification [55.41644538483948]
We introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites.
The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological)
Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
arXiv Detail & Related papers (2021-06-29T10:42:53Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - TArC: Incrementally and Semi-Automatically Collecting a Tunisian Arabish
Corpus [3.8580784887142774]
This article describes the constitution process of the first morpho-syntactically annotated Tunisian Arabish Corpus (TArC)
Arabish, also known as Arabizi, is a spontaneous coding of Arabic dialects in Latin characters and arithmographs (numbers used as letters)
arXiv Detail & Related papers (2020-03-20T22:29:42Z) - Deep Learning for Hindi Text Classification: A Comparison [6.8629257716723]
The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus.
In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention.
The paper also serves as a tutorial for popular text classification techniques.
arXiv Detail & Related papers (2020-01-19T09:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.