Fugu-MT 論文翻訳(概要): PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

論文の概要: PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

arxiv url: http://arxiv.org/abs/2603.24373v1
Date: Wed, 25 Mar 2026 14:54:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.341537
Title: PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks
Title（参考訳）: PP-OCRv5:OCRタスクにおける10億Parameter Vision-Language Modelを用いた500MParameterモデル
Authors: Cheng Cui, Yubo Zhang, Ting Sun, Xueqing Wang, Hongen Liu, Manhui Lin, Yue Zhang, Tingquan Gao, Changda Zhou, Jiaxuan Liu, Zelun Zhang, Jing Zhang, Jun Zhang, Yi Liu,
Abstract要約: OCR 2.0 と大規模視覚言語モデル (VLM) はテキスト認識のベンチマークを新たに設定した。 PP-OCRv5は,500万のパラメータしか持たない高度に最適化された軽量なOCRシステムである。
参考スコア（独自算出の注目度）: 21.41974664575541
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The advent of "OCR 2.0" and large-scale vision-language models (VLMs) has set new benchmarks in text recognition. However, these unified architectures often come with significant computational demands, challenges in precise text localization within complex layouts, and a propensity for textual hallucinations. Revisiting the prevailing notion that model scale is the sole path to high accuracy, this paper introduces PP-OCRv5, a meticulously optimized, lightweight OCR system with merely 5 million parameters. We demonstrate that PP-OCRv5 achieves performance competitive with many billion-parameter VLMs on standard OCR benchmarks, while offering superior localization precision and reduced hallucinations. The cornerstone of our success lies not in architectural expansion but in a data-centric investigation. We systematically dissect the role of training data by quantifying three critical dimensions: data difficulty, data accuracy, and data diversity. Our extensive experiments reveal that with a sufficient volume of high-quality, accurately labeled, and diverse data, the performance ceiling for traditional, efficient two-stage OCR pipelines is far higher than commonly assumed. This work provides compelling evidence for the viability of lightweight, specialized models in the large-model era and offers practical insights into data curation for OCR. The source code and models are publicly available at https://github.com/PaddlePaddle/PaddleOCR.
Abstract（参考訳）: OCR 2.0」と大規模視覚言語モデル(VLM)の出現により、テキスト認識の新しいベンチマークが設定された。しかし、これらの統一されたアーキテクチャは、しばしば重要な計算上の要求、複雑なレイアウト内での正確なテキストローカライゼーションの課題、およびテキスト幻覚の正当性をもたらす。本稿では,モデルスケールが高精度への唯一の道である,という一般的な考え方を再考し,500万のパラメータしか持たない厳密に最適化された軽量OCRシステムであるPP-OCRv5を紹介する。 PP-OCRv5は、標準的なOCRベンチマークで数十億パラメータのVLMと競合する性能を実現し、ローカライズ精度と幻覚の低減を実現している。私たちの成功の基盤は、アーキテクチャの拡張ではなく、データ中心の調査にあります。データ難易度、データの正確性、データの多様性の3つの重要な次元を定量化することで、トレーニングデータの役割を体系的に判別する。大規模な実験により,従来の2段OCRパイプラインの性能天井は,高品質で正確なラベル付き,多種多様であることが明らかとなった。この研究は、大規模モデル時代の軽量で専門的なモデルの生存可能性を示す説得力のある証拠を提供し、OCRのデータキュレーションに関する実践的な洞察を提供する。ソースコードとモデルはhttps://github.com/PaddlePaddle/PaddleOCRで公開されている。

論文の概要: PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks

関連論文リスト