SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition
- URL: http://arxiv.org/abs/2505.24600v1
- Date: Fri, 30 May 2025 13:47:54 GMT
- Title: SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition
- Authors: Omer Nacar, Yasser Al-Habashi, Serry Sibaee, Adel Ammar, Wadii Boulila,
- Abstract summary: SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents.<n>It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage.<n>Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training.
- Score: 0.995313069446686
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Arabic Optical Character Recognition (OCR) is essential for converting vast amounts of Arabic print media into digital formats. However, training modern OCR models, especially powerful vision-language models, is hampered by the lack of large, diverse, and well-structured datasets that mimic real-world book layouts. Existing Arabic OCR datasets often focus on isolated words or lines or are limited in scale, typographic variety, or structural complexity found in books. To address this significant gap, we introduce SARD (Large-Scale Synthetic Arabic OCR Dataset). SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents. It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage. Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training. Its synthetic nature provides unparalleled scalability and allows for precise control over layout and content variation. We detail the dataset's composition and generation process and provide benchmark results for several OCR models, including traditional and deep learning approaches, highlighting the challenges and opportunities presented by this dataset. SARD serves as a valuable resource for developing and evaluating robust OCR and vision-language models capable of processing diverse Arabic book-style texts.
Related papers
- QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation [0.8944616102795021]
We present Qari-OCR, a vision-language models progressively optimized for Arabic.<n>Qari-OCR establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts.
arXiv Detail & Related papers (2025-06-02T22:21:06Z) - PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language [2.1540520105079697]
We develop a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels.<n>PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts.<n>A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models.<n> Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out.
arXiv Detail & Related papers (2025-05-15T07:58:38Z) - Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding [24.9462694200992]
KITAB-Bench is a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems.<n>Modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches by an average of 60% in Character Error Rate (CER)<n>This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods.
arXiv Detail & Related papers (2025-02-20T18:41:23Z) - Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts [0.0]
This study investigates the potential of Large Language Models (LLMs), particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik.<n>Using a meticulously curated dataset of 2,520 images incorporating controlled variations in text length, font size, background color, and blur, the research simulates diverse real-world challenges.
arXiv Detail & Related papers (2024-12-20T18:05:22Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines [1.174020933567308]
Kurdish libraries have many historical publications that were printed back in the early days when printing devices were brought to Kurdistan.
Current Optical Character Recognition (OCR) systems are unable to extract text from historical documents as they have many issues.
In this study, we adopt an open-source OCR framework by Google, Tesseract version 5.0, that has been used to extract text for various languages.
arXiv Detail & Related papers (2024-04-09T08:08:03Z) - COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training [119.03392147066093]
Recent autoregressive vision-language models have excelled in few-shot text generation tasks but face challenges in alignment tasks.
We introduce the contrastive loss into text generation models, partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components.
To bridge this gap, this work introduces VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions.
arXiv Detail & Related papers (2024-01-01T18:58:42Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.