Fugu-MT 論文翻訳(概要): Multimodal OCR: Parse Anything from Documents

論文の概要: Multimodal OCR: Parse Anything from Documents

arxiv url: http://arxiv.org/abs/2603.13032v1
Date: Fri, 13 Mar 2026 14:42:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:12.128182
Title: Multimodal OCR: Parse Anything from Documents
Title（参考訳）: マルチモーダルOCR:ドキュメントからのParseのあらゆるもの
Authors: Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Jiyu Qiu, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Zhaohui Yu, Guang Yang, Colin Zhang, Lei Zhang, Yuliang Liu, Xiang Bai,
Abstract要約: dots.mocrは、チャート、ダイアグラム、テーブル、アイコンなどのビジュアル要素を第一級解析ターゲットとして扱う。テキストとグラフィックの両方を構造化出力として再構築し、より忠実なドキュメント再構築を可能にする。不均一なドキュメント要素に対するエンドツーエンドのトレーニングをサポートする。
参考スコア（独自算出の注目度）: 72.18225200292527
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termed dots.mocr, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantages: (1) it reconstructs both text and graphics as structured outputs, enabling more faithful document reconstruction; (2) it supports end-to-end training over heterogeneous document elements, allowing models to exploit semantic relations between textual and visual components; and (3) it converts previously discarded graphics into reusable code-level supervision, unlocking multimodal supervision embedded in existing documents. To make this paradigm practical at scale, we build a comprehensive data engine from PDFs, rendered webpages, and native SVG assets, and train a compact 3B-parameter model through staged pretraining and supervised fine-tuning. We evaluate dots.mocr from two perspectives: document parsing and structured graphics parsing. On document parsing benchmarks, it ranks second only to Gemini 3 Pro on our OCR Arena Elo leaderboard, surpasses existing open-source document parsing systems, and sets a new state of the art of 83.9 on olmOCR Bench. On structured graphics parsing, dots.mocr achieves higher reconstruction quality than Gemini 3 Pro across image-to-SVG benchmarks, demonstrating strong performance on charts, UI layouts, scientific figures, and chemical diagrams. These results show a scalable path toward building large-scale image-to-code corpora for multimodal pretraining. Code and models are publicly available at https://github.com/rednote-hilab/dots.mocr.
Abstract（参考訳）: テキストとグラフィックを統一したテキスト表現に解析する文書解析パラダイムであるMultimodal OCR(MOCR)を提案する。テキスト認識に焦点を絞った従来のOCRシステムとは異なり,本手法はドット(dots.mocr)と呼ばれ,図表,図表,アイコンなどの視覚的要素を第一級解析対象として扱い,要素間の意味的関係を保ちながら文書を解析することができる。 1) 構造化された出力としてテキストとグラフィックの両方を再構築し、より忠実な文書再構成を可能にし、(2) 不均一な文書要素に対するエンドツーエンドのトレーニングをサポートし、モデルがテキストとビジュアルコンポーネント間のセマンティックな関係を活用できるようにする。このパラダイムを大規模に実践するために、PDF、レンダリングされたWebページ、ネイティブSVGアセットから包括的なデータエンジンを構築し、ステージドプレトレーニングと教師付き微調整を通じてコンパクトな3Bパラメータモデルを訓練する。我々は文書解析と構造化グラフィック解析という2つの観点からdots.mocrを評価した。ドキュメントパースベンチマークでは、OCR Arena EloのリーダーボードでGemini 3 Proに次いで第2位で、既存のオープンソースドキュメントパースシステムを超え、olmOCR Benchで83.9の新しい最先端を設定します。構造化グラフィック解析において、dots.mocrは画像とSVGのベンチマークでGemini 3 Proよりも高い再現性を実現し、チャート、UIレイアウト、科学図、化学図に強いパフォーマンスを示す。これらの結果は,マルチモーダル事前学習のための大規模画像・コードコーパスの構築に向けたスケーラブルな経路を示す。コードとモデルはhttps://github.com/rednote-hilab/dots.mocr.comで公開されている。

論文の概要: Multimodal OCR: Parse Anything from Documents

関連論文リスト