Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding
- URL: http://arxiv.org/abs/2511.04699v1
- Date: Sat, 01 Nov 2025 04:54:58 GMT
- Title: Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding
- Authors: Haneen Al-Homoud, Asma Ibrahim, Murtadha Al-Jubran, Fahad Al-Otaibi, Yazeed Al-Harbi, Daulet Toibazar, Kesen Wang, Pedro J. Moreno,
- Abstract summary: Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU)<n>The dataset comprises over 2.5 million samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts.
- Score: 3.587092806938212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU). The dataset comprises over 2.5 million of samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts. Our pipeline leverages authentic scanned backgrounds, bilingual layouts, and diacritic aware fonts to capture the typographic and structural complexity of Arabic documents. In addition to text, the corpus includes variety of rendered styles for charts and tables. Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) in terms of OCR across multiple public Arabic benchmarks, Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharTeX) improved as well in other modalities. SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis.
Related papers
- synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier [0.0]
We present SynthOCR-Gen, an open-source synthetic OCR dataset generator specifically designed for low-resource languages.<n>Our tool addresses the fundamental bottleneck in OCR development by transforming digital Unicode text corpora into ready-to-use training datasets.<n>We demonstrate the efficacy of our approach by generating a 600,000-sample word-segmented Kashmiri OCR dataset.
arXiv Detail & Related papers (2026-01-22T17:01:33Z) - UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters [55.34921520578968]
Vision-language models (VLMs) have achieved impressive unified recognition of text and formulas.<n>We propose UniRec-0.1B, a unified recognition model with only 0.1B parameters.<n>It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents.
arXiv Detail & Related papers (2025-12-24T10:35:21Z) - MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.05126590825121]
MonkeyOCR v1.5 is a unified vision-language framework that enhances both layout understanding and content recognition.<n>To address complex table structures, we propose a visual consistency-based reinforcement learning scheme.<n>Two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables.
arXiv Detail & Related papers (2025-11-13T15:12:17Z) - QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation [0.8944616102795021]
We present Qari-OCR, a vision-language models progressively optimized for Arabic.<n>Qari-OCR establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts.
arXiv Detail & Related papers (2025-06-02T22:21:06Z) - SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition [0.995313069446686]
SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents.<n>It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage.<n>Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training.
arXiv Detail & Related papers (2025-05-30T13:47:54Z) - Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [79.71072337496351]
CoSyn is a framework that creates synthetic text-rich multimodal data.<n>It can generate high-quality instruction-tuning data.<n>It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
arXiv Detail & Related papers (2025-02-20T18:55:30Z) - Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts [0.0]
This study investigates the potential of Large Language Models (LLMs), particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik.<n>Using a meticulously curated dataset of 2,520 images incorporating controlled variations in text length, font size, background color, and blur, the research simulates diverse real-world challenges.
arXiv Detail & Related papers (2024-12-20T18:05:22Z) - Discourse Features Enhance Detection of Document-Level Machine-Generated Content [53.41994768824785]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features.<n>We introduce novel methodologies and datasets to overcome these challenges.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding [23.910783272007407]
This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU)
Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset.
Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies.
arXiv Detail & Related papers (2024-08-27T03:31:24Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - WordScape: a Pipeline to extract multilingual, visually rich Documents
with Layout Annotations from Web Crawl Data [13.297444760076406]
We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora.
WordScape parses the Open XML structure of Word documents obtained from the web.
It offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text.
arXiv Detail & Related papers (2023-12-15T20:28:31Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.