Fugu-MT 論文翻訳(概要): MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

論文の概要: MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

arxiv url: http://arxiv.org/abs/2605.24973v1
Date: Sun, 24 May 2026 10:00:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.554887
Title: MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing
Title（参考訳）: MinerU-Popo:構造化文書解析のためのユニバーサル後処理モデル
Authors: Bangrui Xu, Ziyang Miao, Xuanhe Zhou, Yiming Lin, Zirui Tang, Xiaomeng Zhao, Fan Wu, Cheng Tan, Fan Wu, Bin Wang, Conghui He,
Abstract要約: MinerU-Popoは、OCR出力をPast-Processingする軽量フレームワークである。問題をテキストトランケーション回復、テーブルトランケーション回復、タイトル階層、画像テキスト関連という4つのサブタスクに分解する。 5つのテストされたOCRモデルで、TEDSのタイトル階層を少なくとも20%改善する。
参考スコア（独自算出の注目度）: 34.19535115746437
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: VLM-based OCR models have become the de facto choice for document parsing, as they can accurately extract page-level elements (e.g., paragraphs within individual pages) together with their bounding boxes and textual content. However, downstream applications such as RAG require coherent document-level information, whereas these models often break cross-page continuity and fail to recover disrupted structures, such as paragraphs and tables truncated by page boundaries. Such relationships are not confined to a single page; instead, they require joint analysis of titles, paragraphs, tables, and images spanning multiple pages. A natural solution is therefore to reuse existing OCR outputs and reconstruct document-level logical structures through post-processing. To this end, we propose MinerU-Popo, a lightweight and universal framework for POst-Processing OCR outputs, which converts page-level results from diverse parsers into coherent document-level structures. MinerU-Popo decomposes the problem into four focused subtasks: text truncation recovery, table truncation recovery, title hierarchy reconstruction, and image-text association. To address these effectively, we build a task-oriented data engine with task-specific input filtering, and use the generated data (30K) to fine-tune a lightweight post-processing model (Qwen3-VL-4B). To support long documents, we introduce dynamic chunking with overlap-based synchronization, which aligns chunk-level outputs from the fine-tuned model and preserves global consistency. Finally, we assemble the aligned outputs into a tree-structured document representation, further enriched with node chunking and summaries for downstream retrieval and analysis. Empirical results show MinerU-Popo improves title-hierarchy TEDS by at least 20% across all five tested OCR models, improves RAG accuracy and reduces per-query latency.
Abstract（参考訳）: VLMベースのOCRモデルは、ページレベルの要素(例えば、個々のページ内の段落)を、バウンディングボックスやテキストコンテンツとともに正確に抽出できるため、文書解析の事実上の選択肢となっている。しかしながら、RAGのような下流のアプリケーションは一貫性のある文書レベルの情報を必要とするが、これらのモデルはページ間の連続性を破り、ページ境界で区切られた段落やテーブルのような破壊された構造を復元することができない。このような関係は1ページに限定されるのではなく、タイトル、段落、表、複数のページにまたがるイメージを共同で分析する必要がある。したがって、自然な解決策は、既存のOCR出力を再利用し、後処理によって文書レベルの論理構造を再構築することである。この目的のために,多種多様なパーサからページレベルの結果をコヒーレントな文書レベルの構造に変換する,POst-Processing OCR出力のための軽量で普遍的なフレームワークであるMinerU-Popoを提案する。 MinerU-Popoは、問題をテキストトランケーション回復、テーブルトランケーション回復、タイトル階層再構築、画像テキスト関連という4つのサブタスクに分解する。これらに効果的に対処するために、タスク固有の入力フィルタリングを備えたタスク指向データエンジンを構築し、生成されたデータ(30K)を用いて軽量な後処理モデル(Qwen3-VL-4B)を微調整する。長いドキュメントをサポートするために、オーバーラップベース同期による動的チャンキングを導入し、細調整されたモデルからチャンクレベルの出力を整列させ、グローバルな一貫性を保つ。最後に、整列した出力を木構造文書表現に集約し、さらに下流の検索と解析のためにノードチャンキングと要約を付加する。実証的な結果から、MinerU-Popoは5つのテストされたOCRモデルで少なくとも20%のタイトル階層TEDSを改善し、RAG精度を改善し、クエリ毎のレイテンシを低減する。

論文の概要: MinerU-Popo: Universal Post-Processing Model for Structured Document Parsing

関連論文リスト