QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation
- URL: http://arxiv.org/abs/2506.02295v1
- Date: Mon, 02 Jun 2025 22:21:06 GMT
- Title: QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation
- Authors: Ahmed Wasfy, Omer Nacar, Abdelakreem Elkhateb, Mahmoud Reda, Omar Elshehy, Adel Ammar, Wadii Boulila,
- Abstract summary: We present Qari-OCR, a vision-language models progressively optimized for Arabic.<n>Qari-OCR establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts.
- Score: 0.8944616102795021
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The inherent complexities of Arabic script; its cursive nature, diacritical marks (tashkeel), and varied typography, pose persistent challenges for Optical Character Recognition (OCR). We present Qari-OCR, a series of vision-language models derived from Qwen2-VL-2B-Instruct, progressively optimized for Arabic through iterative fine-tuning on specialized synthetic datasets. Our leading model, QARI v0.2, establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts. Qari-OCR demonstrates superior handling of tashkeel, diverse fonts, and document layouts, alongside impressive performance on low-resolution images. Further explorations (QARI v0.3) showcase strong potential for structural document understanding and handwritten text. This work delivers a marked improvement in Arabic OCR accuracy and efficiency, with all models and datasets released to foster further research.
Related papers
- SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition [0.995313069446686]
SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents.<n>It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage.<n>Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training.
arXiv Detail & Related papers (2025-05-30T13:47:54Z) - PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language [2.1540520105079697]
We develop a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels.<n>PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts.<n>A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models.<n> Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out.
arXiv Detail & Related papers (2025-05-15T07:58:38Z) - LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis [56.00885545573299]
We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis.<n>Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1.<n>We develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina.
arXiv Detail & Related papers (2025-03-27T17:56:15Z) - KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding [24.9462694200992]
KITAB-Bench is a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems.<n>Modern vision-language models (such as GPT-4, Gemini, and Qwen) outperform traditional OCR approaches by an average of 60% in Character Error Rate (CER)<n>This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods.
arXiv Detail & Related papers (2025-02-20T18:41:23Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Arabic Handwritten Document OCR Solution with Binarization and Adaptive Scale Fusion Detection [1.1655046053160683]
We present a complete OCR pipeline that starts with line segmentation and Adaptive Scale Fusion techniques to ensure accurate detection of text lines.<n>Our system, trained on the Arabic Multi-Fonts dataset, achieves a Character Recognition Rate (CRR) of 99.20% and a Word Recognition Rate (WRR) of 93.75% on single-word samples containing 7 to 10 characters.
arXiv Detail & Related papers (2024-12-02T15:21:09Z) - Qalam : A Multimodal LLM for Arabic Optical Character and Handwriting Recognition [18.280762424107408]
This study introduces Qalam, a novel foundation model designed for Arabic OCR and HWR.
Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks.
arXiv Detail & Related papers (2024-07-18T14:31:09Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Strategies for improving low resource speech to text translation relying
on pre-trained ASR models [59.90106959717875]
This paper presents techniques and findings for improving the performance of low-resource speech to text translation (ST)
We conducted experiments on both simulated and real-low resource setups, on language pairs English - Portuguese, and Tamasheq - French respectively.
arXiv Detail & Related papers (2023-05-31T21:58:07Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.