Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script
- URL: http://arxiv.org/abs/2503.15023v1
- Date: Wed, 19 Mar 2025 09:20:42 GMT
- Title: Bridging the Gap: Fusing CNNs and Transformers to Decode the Elegance of Handwritten Arabic Script
- Authors: Chaouki Boufenar, Mehdi Ayoub Rabiai, Boualem Nadjib Zahaf, Khelil Rafik Ouaras,
- Abstract summary: Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations.<n>This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities.<n>Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Handwritten Arabic script recognition is a challenging task due to the script's dynamic letter forms and contextual variations. This paper proposes a hybrid approach combining convolutional neural networks (CNNs) and Transformer-based architectures to address these complexities. We evaluated custom and fine-tuned models, including EfficientNet-B7 and Vision Transformer (ViT-B16), and introduced an ensemble model that leverages confidence-based fusion to integrate their strengths. Our ensemble achieves remarkable performance on the IFN/ENIT dataset, with 96.38% accuracy for letter classification and 97.22% for positional classification. The results highlight the complementary nature of CNNs and Transformers, demonstrating their combined potential for robust Arabic handwriting recognition. This work advances OCR systems, offering a scalable solution for real-world applications.
Related papers
- Arabic Handwritten Document OCR Solution with Binarization and Adaptive Scale Fusion Detection [1.1655046053160683]
We present a complete OCR pipeline that starts with line segmentation and Adaptive Scale Fusion techniques to ensure accurate detection of text lines.<n>Our system, trained on the Arabic Multi-Fonts dataset, achieves a Character Recognition Rate (CRR) of 99.20% and a Word Recognition Rate (WRR) of 93.75% on single-word samples containing 7 to 10 characters.
arXiv Detail & Related papers (2024-12-02T15:21:09Z) - Advanced Arabic Alphabet Sign Language Recognition Using Transfer Learning and Transformer Models [0.0]
This paper presents an Arabic Alphabet Sign Language recognition approach, using deep learning methods in conjunction with transfer learning and transformer-based models.
We study the performance of the different variants on two publicly available datasets, namely ArSL2018 and AASL.
Experimental results present evidence that the suggested methodology can receive a high recognition accuracy, by up to 99.6% and 99.43% on ArSL2018 and AASL, respectively.
arXiv Detail & Related papers (2024-10-01T13:39:26Z) - Classification of Non-native Handwritten Characters Using Convolutional Neural Network [0.0]
The classification of English characters written by non-native users is performed by proposing a custom-tailored CNN model.
We train this CNN with a new dataset called the handwritten isolated English character dataset.
The proposed model with five convolutional layers and one hidden layer outperforms state-of-the-art models in terms of character recognition accuracy.
arXiv Detail & Related papers (2024-06-06T21:08:07Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - An Ensemble Method Based on the Combination of Transformers with
Convolutional Neural Networks to Detect Artificially Generated Text [0.0]
We present some classification models constructed by ensembling transformer models such as Sci-BERT, DeBERTa and XLNet, with Convolutional Neural Networks (CNNs)
Our experiments demonstrate that the considered ensemble architectures surpass the performance of the individual transformer models for classification.
arXiv Detail & Related papers (2023-10-26T11:17:03Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - A Transformer-based Approach for Arabic Offline Handwritten Text
Recognition [0.0]
We introduce two alternative architectures for recognizing offline Arabic handwritten text.
Our approach can model language dependencies and relies only on the attention mechanism, thereby making it more parallelizable and less complex.
Our evaluation on the Arabic KHATT dataset demonstrates that our proposed method outperforms the current state-of-the-art approaches.
arXiv Detail & Related papers (2023-07-27T17:51:52Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.