Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
- URL: http://arxiv.org/abs/2509.18174v1
- Date: Wed, 17 Sep 2025 15:07:29 GMT
- Title: Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
- Authors: Khalil Hennara, Muhammad Hreden, Mohamed Motasim Hamed, Ahmad Bastati, Zeina Aldallal, Sara Chrouf, Safwan AlModhayan,
- Abstract summary: We introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR.<n>Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy.<n>Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25.
- Score: 1.7590081165362783
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Arabic document OCR remains a challenging task due to the language's cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.
Related papers
- Multimodal Evaluation of Russian-language Architectures [88.00147763684451]
We introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures.<n>The benchmark is instruction-based and encompasses default text, image, audio, and video modalities.<n>Mera Multi provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages.
arXiv Detail & Related papers (2025-11-19T15:43:53Z) - Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency [31.095908827004695]
Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks.<n>They struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges.<n>We introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage"
arXiv Detail & Related papers (2025-07-11T05:02:06Z) - QARI-OCR: High-Fidelity Arabic Text Recognition through Multimodal Large Language Model Adaptation [0.8944616102795021]
We present Qari-OCR, a vision-language models progressively optimized for Arabic.<n>Qari-OCR establishes a new open-source state-of-the-art with a Word Error Rate (WER) of 0.160, Character Error Rate (CER) of 0.061, and BLEU score of 0.737 on diacritically-rich texts.
arXiv Detail & Related papers (2025-06-02T22:21:06Z) - SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition [0.995313069446686]
SARD is a massive, synthetically generated dataset specifically designed to simulate book-style documents.<n>It comprises 843,622 document images containing 690 million words, rendered across ten distinct Arabic fonts to ensure broad typographic coverage.<n>Unlike datasets derived from scanned documents, SARD is free from real-world noise and distortions, offering a clean and controlled environment for model training.
arXiv Detail & Related papers (2025-05-30T13:47:54Z) - Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines [0.8944616102795021]
This study addresses the critical gap in Arabic natural language processing by developing an effective Arabic Reverse Dictionary (RD) system.<n>We present a novel transformer-based approach with a semi-encoder neural network architecture featuring geometrically decreasing layers.<n>Our methodology incorporates a comprehensive dataset construction process and establishes formal quality standards for Arabic lexicographic definitions.
arXiv Detail & Related papers (2025-04-30T09:56:36Z) - KITAB-Bench: A Comprehensive Multi-Domain Benchmark for Arabic OCR and Document Understanding [24.9462694200992]
KITAB-Bench is a comprehensive Arabic OCR benchmark that fills the gaps in current evaluation systems.<n>Modern vision-language models (such as GPT-4o, Gemini, and Qwen) outperform traditional OCR approaches by an average of 60% in Character Error Rate (CER)<n>This work establishes a rigorous evaluation framework that can drive improvements in Arabic document analysis methods.
arXiv Detail & Related papers (2025-02-20T18:41:23Z) - CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.