New Results for the Text Recognition of Arabic Maghrib{\=i} Manuscripts
  -- Managing an Under-resourced Script
        - URL: http://arxiv.org/abs/2211.16147v1
- Date: Tue, 29 Nov 2022 12:21:41 GMT
- Title: New Results for the Text Recognition of Arabic Maghrib{\=i} Manuscripts
  -- Managing an Under-resourced Script
- Authors: Lucas No\"emie, Cl\'ement Salah (SU, UNIL), Chahan Vidal-Gor\`ene
  (ENC)
- Abstract summary: We introduce and assess a new modus operandi for HTR models development and fine-tuning dedicated to the Arabic Maghrib=i scripts.
The comparison between several state-of-the-art HTR models demonstrates the relevance of a word-based neural approach specialized for Arabic.
Results open new perspectives for Arabic scripts processing and more generally for poorly-endowed languages processing.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   HTR models development has become a conventional step for digital humanities
projects. The performance of these models, often quite high, relies on manual
transcription and numerous handwritten documents. Although the method has
proven successful for Latin scripts, a similar amount of data is not yet
achievable for scripts considered poorly-endowed, like Arabic scripts. In that
respect, we are introducing and assessing a new modus operandi for HTR models
development and fine-tuning dedicated to the Arabic Maghrib{\=i} scripts. The
comparison between several state-of-the-art HTR demonstrates the relevance of a
word-based neural approach specialized for Arabic, capable to achieve an error
rate below 5% with only 10 pages manually transcribed. These results open new
perspectives for Arabic scripts processing and more generally for
poorly-endowed languages processing. This research is part of the development
of RASAM dataset in partnership with the GIS MOMM and the BULAC.
 
      
        Related papers
        - The Role of Orthographic Consistency in Multilingual Embedding Models   for Text Classification in Arabic-Script Languages [30.39307182175106]
 We introduce the Arabic Script RoBERTa (AS-RoBERTa) family: four RoBERTa-based models, each pre-trained on a large corpus tailored to its specific language.<n>Our results highlight the value of script-aware specialization for languages using the Arabic script and support further work on pre-training strategies rooted in script and language specificity.
 arXiv  Detail & Related papers  (2025-07-24T19:28:33Z)
- Detecting Document-level Paraphrased Machine Generated Content:   Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
 Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.
We introduce novel methodologies and datasets to overcome these challenges.
We propose MhBART, an encoder-decoder model designed to emulate human writing style.
We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
 arXiv  Detail & Related papers  (2024-12-17T08:47:41Z)
- Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary   Expansion [55.27025066199226]
 This paper addresses the need for democratizing large language models (LLM) in the Arab world.
One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.
Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
 arXiv  Detail & Related papers  (2024-12-16T19:29:06Z)
- Less is More: Making Smaller Language Models Competent Subgraph   Retrievers for Multi-hop KGQA [51.3033125256716]
 We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
 arXiv  Detail & Related papers  (2024-10-08T15:22:36Z)
- HATFormer: Historic Handwritten Arabic Text Recognition with   Transformers [6.3660090769559945]
 Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models.
We propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model.
Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges.
 arXiv  Detail & Related papers  (2024-10-03T03:43:29Z)
- Exploring the Role of Transliteration in In-Context Learning for   Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
 We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts.
We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both.
Our findings show that the effectiveness of transliteration varies by task type and model size.
 arXiv  Detail & Related papers  (2024-07-02T14:51:20Z)
- TransMI: A Framework to Create Strong Baselines from Multilingual   Pretrained Language Models for Transliterated Data [50.40191599304911]
 We propose Transliterate transliteration-Merge (TransMI), which can create a strong baseline well-suited for data that is transliterated into a common script.
Results show a consistent improvement of 3% to 34%, varying across different models and tasks.
 arXiv  Detail & Related papers  (2024-05-16T09:08:09Z)
- AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
 The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
 arXiv  Detail & Related papers  (2023-09-21T13:20:13Z)
- Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
 We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
 arXiv  Detail & Related papers  (2023-01-26T20:37:03Z)
- Huruf: An Application for Arabic Handwritten Character Recognition Using
  Deep Learning [0.0]
 We propose a lightweight Convolutional Neural Network-based architecture for recognizing Arabic characters and digits.
The proposed pipeline consists of a total of 18 layers containing four layers each for convolution, pooling, batch normalization, dropout, and finally one Global average layer.
The proposed model respectively achieved an accuracy of 96.93% and 99.35% which is comparable to the state-of-the-art and makes it a suitable solution for real-life end-level applications.
 arXiv  Detail & Related papers  (2022-12-16T17:39:32Z)
- Classification of Handwritten Names of Cities and Handwritten Text
  Recognition using Various Deep Learning Models [0.0]
 We have tried to describe various approaches and achievements of recent years in the development of handwritten recognition models.
The first model uses deep convolutional neural networks (CNNs) for feature extraction and a fully connected multilayer perceptron neural network (MLP) for word classification.
The second model, called SimpleHTR, uses CNN and recurrent neural network (RNN) layers to extract information from images.
 arXiv  Detail & Related papers  (2021-02-09T13:34:16Z)
- UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
 Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
 arXiv  Detail & Related papers  (2020-12-31T11:37:28Z)
- A Hybrid Deep Learning Model for Arabic Text Recognition [2.064612766965483]
 This paper presents a model that can recognize Arabic text that was printed using multiple font types.
The proposed model employs a hybrid DL network that can recognize Arabic printed text without the need for character segmentation.
The model achieved good results in recognizing characters and words and it also achieved promising results in recognizing characters when it was tested on unseen data.
 arXiv  Detail & Related papers  (2020-09-04T02:49:17Z)
- Deep Learning for Hindi Text Classification: A Comparison [6.8629257716723]
 The research in the classification of morphologically rich and low resource Hindi language written in Devanagari script has been limited due to the absence of large labeled corpus.
In this work, we used translated versions of English data-sets to evaluate models based on CNN, LSTM and Attention.
The paper also serves as a tutorial for popular text classification techniques.
 arXiv  Detail & Related papers  (2020-01-19T09:29:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.