Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition
- URL: http://arxiv.org/abs/2406.09630v1
- Date: Thu, 13 Jun 2024 23:40:34 GMT
- Title: Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition
- Authors: Mehreen Saeed, Adrian Chan, Anupam Mijar, Joseph Moukarzel, Georges Habchi, Carlos Younes, Amin Elias, Chau-Wai Wong, Akram Khater,
- Abstract summary: The Manuscripts of Handwritten Arabic(Muharaf) dataset is a machine learning dataset consisting of more than 1,600 historic handwritten page images.
This dataset was compiled to advance the state of the art in handwritten text recognition.
- Score: 5.28595286827031
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present the Manuscripts of Handwritten Arabic~(Muharaf) dataset, which is a machine learning dataset consisting of more than 1,600 historic handwritten page images transcribed by experts in archival Arabic. Each document image is accompanied by spatial polygonal coordinates of its text lines as well as basic page elements. This dataset was compiled to advance the state of the art in handwritten text recognition (HTR), not only for Arabic manuscripts but also for cursive text in general. The Muharaf dataset includes diverse handwriting styles and a wide range of document types, including personal letters, diaries, notes, poems, church records, and legal correspondences. In this paper, we describe the data acquisition pipeline, notable dataset features, and statistics. We also provide a preliminary baseline result achieved by training convolutional neural networks using this data.
Related papers
- HATFormer: Historic Handwritten Arabic Text Recognition with Transformers [6.3660090769559945]
Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models.
We propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model.
Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges.
arXiv Detail & Related papers (2024-10-03T03:43:29Z) - AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing [82.33075210051129]
We introduce AceParse, the first comprehensive dataset designed to support the parsing of structured texts.
Based on AceParse, we fine-tuned a multimodal model, named Ace, which accurately parses various structured texts.
This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity.
arXiv Detail & Related papers (2024-09-16T06:06:34Z) - Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - Arabic Handwritten Text Line Dataset [0.0]
We present a new dataset specifically designed for historical Arabic script in which we annotate position in word level.
The problem of segmentation into text lines is solved since there are carefully annotated dataset dedicated to this task.
arXiv Detail & Related papers (2023-12-10T14:32:25Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage.
Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors.
We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - KOHTD: Kazakh Offline Handwritten Text Dataset [0.0]
We propose an extensive Kazakh offline Handwritten Text dataset (KOHTD)
KOHTD has 3000 handwritten exam papers and more than 140335 segmented images and there are approximately 922010 symbols.
We used a variety of popular text recognition methods for word and line recognition in our studies, including CTC-based and attention-based methods.
arXiv Detail & Related papers (2021-09-22T16:19:38Z) - One-shot Compositional Data Generation for Low Resource Handwritten Text
Recognition [10.473427493876422]
Low resource Handwritten Text Recognition is a hard problem due to the scarce annotated data and the very limited linguistic information.
In this paper we address this problem through a data generation technique based on Bayesian Program Learning.
Contrary to traditional generation approaches, which require a huge amount of annotated images, our method is able to generate human-like handwriting using only one sample of each symbol from the desired alphabet.
arXiv Detail & Related papers (2021-05-11T18:53:01Z) - Handwriting Classification for the Analysis of Art-Historical Documents [6.918282834668529]
We focus on the analysis of handwriting in scanned documents from the art-historic archive of the WPI.
We propose a handwriting classification model that labels extracted text fragments based on their visual structure.
arXiv Detail & Related papers (2020-11-04T13:06:46Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.