Fugu-MT 論文翻訳(概要): MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

論文の概要: MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

arxiv url: http://arxiv.org/abs/2511.10390v1
Date: Fri, 14 Nov 2025 01:48:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-14 22:53:22.84261
Title: MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
Title（参考訳）: MonkeyOCR v1.5テクニカルレポート - 複雑なパターンに対するロバストなドキュメント解析のアンロック
Authors: Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai,
Abstract要約: MonkeyOCR v1.5は、2段階の解析パイプラインを通じてレイアウト理解とコンテンツ認識の両方を強化する、統一されたビジョン言語フレームワークである。複雑なテーブル構造に対処するために,レンダリング・アンド・コンペアアライメントによる認識品質の評価を行う視覚的一貫性に基づく強化学習手法を提案する。組込み画像を含むテーブルの信頼性の高い解析と、ページや列を横断するテーブルの再構築を可能にするために、2つの特別なモジュール、Image-Decoupled Table ParsingとType-Guided Table Mergingが導入されている。
参考スコア（独自算出の注目度）: 80.05126590825121
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.
Abstract（参考訳）: 文書解析はドキュメントインテリジェンスにおける中核的なタスクであり、情報抽出、検索強化生成、自動文書解析などのアプリケーションをサポートする。しかし、実世界の文書は、しばしば、マルチレベルテーブル、埋め込み画像または公式、および既存のOCRシステムでは困難なクロスページ構造を持つ複雑なレイアウトを特徴としている。 MonkeyOCR v1.5は、レイアウト理解とコンテンツ認識の両方を2段階解析パイプラインを通じて強化する統合ビジョン言語フレームワークである。第1段階では、大きなマルチモーダルモデルを用いて、文書のレイアウトと読み込み順序を共同で予測し、視覚情報を活用して構造的およびシーケンシャルな一貫性を確保する。第2段階は、検出された領域内のテキスト、公式、テーブルの局所的認識を行い、エラー伝搬を低減しつつ高い視覚的忠実性を維持する。複雑なテーブル構造に対処するために,レンダリング・アンド・コンパレントによる認識品質の評価を行い,手動のアノテーションを使わずに構造精度を向上させる視覚的一貫性に基づく強化学習手法を提案する。さらに、組込み画像を含むテーブルの信頼性の高い解析と、テーブル横断ページや列の再構築を可能にするために、2つの特別なモジュール、Image-Decoupled Table ParsingとType-Guided Table Mergingが導入されている。 OmniDocBench v1.5に関する総合的な実験では、MonkeyOCR v1.5が最先端のパフォーマンスを実現し、PPOCR-VLとMinerU 2.5を上回りながら、視覚的に複雑なドキュメントシナリオにおいて極めて堅牢性を示している。

論文の概要: MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

関連論文リスト