Fugu-MT 論文翻訳(概要): Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

論文の概要: Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

arxiv url: http://arxiv.org/abs/2603.23885v1
Date: Wed, 25 Mar 2026 03:19:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.106261
Title: Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training
Title（参考訳）: リアルなシーン合成と文書認識学習による実世界の文書解析に向けて
Authors: Gengluo Li, Chengquan Zhang, Yupu Liang, Huawen Shen, Yaping Zhang, Pengyuan Lyu, Weinong Wang, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou,
Abstract要約: 本稿では、堅牢なエンドツーエンド文書解析のためのデータ学習協調設計フレームワークを提案する。本手法は,スキャン・デジタル・実世界の両方のシナリオにおいて,精度とロバスト性を向上する。
参考スコア（独自算出の注目度）: 29.093072408848467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.
Abstract（参考訳）: 文書解析は、文書画像を構造化出力に直接マッピングするマルチモーダル大言語モデル(MLLM)で最近進歩している。従来のカスケードパイプラインは正確なレイアウト解析に依存しており、カジュアルにキャプチャされたり、非標準条件で失敗することが多い。エンドツーエンドのアプローチは、この依存関係を緩和する一方で、大規模で高品質なフルページ(ドキュメントレベル)のエンドツーエンド解析データの不足と構造対応のトレーニング戦略の欠如による、反復的、幻覚的、構造的に一貫性のない予測を示す。これらの課題に対処するために、堅牢なエンドツーエンド文書解析のためのデータ学習協調設計フレームワークを提案する。ドキュメント・アウェア・トレーニング・レシピ(Document-Aware Training Recipe)では、構造化の忠実さと復号安定性を高めるために、プログレッシブ・ラーニングと構造化の最適化が導入されている。さらにWild-OmniDocBenchは、実世界のキャプチャードキュメントから派生した、堅牢性評価のためのベンチマークである。提案手法は,1BパラメータMLLMに統合され,スキャン・デジタル・実世界の両方のシナリオにおいて,精度とロバスト性を向上する。すべてのモデル、データ合成パイプライン、ベンチマークが公開され、ドキュメント理解の今後の研究が進められる。

論文の概要: Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

関連論文リスト