PaddleOCR 3.0 Technical Report
- URL: http://arxiv.org/abs/2507.05595v1
- Date: Tue, 08 Jul 2025 02:14:10 GMT
- Title: PaddleOCR 3.0 Technical Report
- Authors: Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma,
- Abstract summary: PaddleOCR 3.0 is an Apache-licensed open-source toolkit for OCR and document parsing.<n>Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency.
- Score: 21.810256827625217
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This technical report introduces PaddleOCR 3.0, an Apache-licensed open-source toolkit for OCR and document parsing. To address the growing demand for document understanding in the era of large language models, PaddleOCR 3.0 presents three major solutions: (1) PP-OCRv5 for multilingual text recognition, (2) PP-StructureV3 for hierarchical document parsing, and (3) PP-ChatOCRv4 for key information extraction. Compared to mainstream vision-language models (VLMs), these models with fewer than 100 million parameters achieve competitive accuracy and efficiency, rivaling billion-parameter VLMs. In addition to offering a high-quality OCR model library, PaddleOCR 3.0 provides efficient tools for training, inference, and deployment, supports heterogeneous hardware acceleration, and enables developers to easily build intelligent document applications.
Related papers
- MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm [60.14048367611333]
MonkeyOCR is a vision-language model for document parsing.<n>It advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm.
arXiv Detail & Related papers (2025-06-05T16:34:57Z) - PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language [2.1540520105079697]
We develop a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels.<n>PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts.<n>A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models.<n> Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out.
arXiv Detail & Related papers (2025-05-15T07:58:38Z) - VISTA-OCR: Towards generative and interactive end to end OCR models [3.7548609506798494]
VISTA-OCR is a lightweight architecture that unifies text detection and recognition within a single generative model.<n>Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase.<n>To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples.
arXiv Detail & Related papers (2025-04-04T17:39:53Z) - Efficient Multi-Instance Generation with Janus-Pro-Dirven Prompt Parsing [53.295515505026096]
Janus-Pro-driven Prompt Parsing is a prompt- parsing module that bridges text understanding and layout generation.<n>MIGLoRA is a parameter-efficient plug-in integrating Low-Rank Adaptation into UNet (SD1.5) and DiT (SD3) backbones.<n>The proposed method achieves state-of-the-art performance on COCO and LVIS benchmarks while maintaining parameter efficiency.
arXiv Detail & Related papers (2025-03-27T00:59:14Z) - BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.<n>This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.<n>Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z) - PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z) - LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression.
We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols.
It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z) - EfficientOCR: An Extensible, Open-Source Package for Efficiently
Digitizing World Knowledge [1.8434042562191815]
EffOCR is a novel open-source optical character recognition (OCR) package.
It meets both the computational and sample efficiency requirements for liberating texts at scale.
EffOCR is cheap and sample efficient to train, as the model only needs to learn characters' visual appearance and not how they are used in sequence to form language.
arXiv Detail & Related papers (2023-10-16T04:20:16Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR
System [11.622321298214043]
PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2.
Experiments on real data show that the hmean of PP-OCRv3 is 5% higher than PP-OCRv2 under comparable inference speed.
arXiv Detail & Related papers (2022-06-07T04:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.