Fugu-MT 論文翻訳(概要): CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

論文の概要: CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

arxiv url: http://arxiv.org/abs/2605.03903v1
Date: Tue, 05 May 2026 15:56:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 19:35:44.017951
Title: CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
Title（参考訳）: CC-OCR V2:リアルタイム文書処理におけるリテラシーのための大規模マルチモーダルモデルのベンチマーク
Authors: Zhipeng Xu, Junhao Ji, Zulong Chen, Zhenghao Liu, Qing Liu, Chunyi Peng, Zubao Qin, Ze Xu, Jianqiang Wan, Jun Tang, Zhibo Yang, Shuai Bai, Dayiheng Liu,
Abstract要約: CC-OCR V2は、現実世界の文書処理に適した総合的で挑戦的なOCRベンチマークである。 CC-OCR V2は、実際のエンタープライズ文書処理タスクに重点を置いており、以前のベンチマークでは重要ではありませんでしたが、ハードケースとコーナーケースを取り入れています。 14の先進的なLMMの実験により、現在のモデルは現実世界のアプリケーション要件に満たないことが明らかになった。
参考スコア（独自算出の注目度）: 33.84177435117706
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.
Abstract（参考訳）: 大規模マルチモーダルモデル(LMM)は近年,光学文字認識(OCR)タスクにおいて高い性能を示し,文書リテラシーに期待できる能力を示した。しかし、既存のベンチマークでは、実際のアプリケーションとミスマッチしたタスクスコープを採用し、均質な取得条件を仮定しているため、実世界のアプリケーションにおけるそれらの有効性は未解明のままである。このギャップに対処するため,実世界の文書処理に適した総合的かつ挑戦的なOCRベンチマークであるCC-OCR V2を導入する。 CC-OCR V2は、実際のエンタープライズ文書処理タスクに重点を置いており、テキスト認識、文書解析、文書グラウンド、キー情報抽出、文書質問応答の5つの主要なOCR中心のトラックをカバーしている。 14の先進的なLMMに関する大規模な実験により、現在のモデルは現実世界のアプリケーション要件に満たないことが明らかになった。最先端のLMMでさえ、さまざまなタスクやシナリオで大幅にパフォーマンスが低下します。これらの結果は、現在のベンチマークのパフォーマンスと実世界のアプリケーションにおける有効性の間に大きなギャップがあることを示唆している。完全なデータセットと評価ツールキットはhttps://github.com/eioss/CC-OCR-V2で公開しています。

論文の概要: CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

関連論文リスト