Fugu-MT 論文翻訳(概要): OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

論文の概要: OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

arxiv url: http://arxiv.org/abs/2505.17163v1
Date: Thu, 22 May 2025 15:25:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-26 18:08:33.62505
Title: OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning
Title（参考訳）: OCR推論ベンチマーク:複雑なテキストリッチイメージ推論におけるMLLMの真の能力を明らかにする
Authors: Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin,
Abstract要約: OCR-Reasoningは、テキストリッチな画像推論タスクでマルチモーダル大言語モデルを評価するために設計された包括的なベンチマークである。このベンチマークは、6つのコア推論能力と、テキストリッチなビジュアルシナリオにおける18の実践的推論タスクにまたがる1069の人手による例で構成されている。注釈付き推論プロセスと最終回答により、OCR-Reasoningはモデルによって生成された最終回答だけでなく、それらの推論プロセスも評価する。
参考スコア（独自算出の注目度）: 39.141660558608265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.
Abstract（参考訳）: マルチモーダルなスロー思考システムの最近の進歩は、様々な視覚的推論タスクにおいて顕著な性能を示している。しかし、それらのテキストリッチな画像推論タスクの機能は、体系的なベンチマークが欠如しているため、まだ検討されていない。このギャップに対処するため,テキストリッチな画像推論タスクにおいて,マルチモーダル大規模言語モデルを体系的に評価するための総合ベンチマークであるOCR-Reasoningを提案する。このベンチマークは、6つのコア推論能力と、テキストリッチなビジュアルシナリオにおける18の実践的推論タスクにまたがる1069の人手による例で構成されている。さらに、最終回答のみに注釈を付ける他のテキストリッチ画像理解ベンチマークとは異なり、OCR-Reasoningは同時に推論プロセスに注釈を付ける。注釈付き推論プロセスと最終回答を用いて、OCR-Reasoningはモデルによって生成される最終回答だけでなく、それらの推論プロセスも評価し、それらの問題解決能力の全体的分析を可能にする。本ベンチマークを応用し,最先端MLLMの総合評価を行った。本結果は,既存の方法論の限界を示すものである。特に、最先端のMLLMでさえかなり困難であり、OCR-Reasoningで精度が50%を超えることはなく、テキストリッチな画像推論の課題は緊急の課題であることを示している。ベンチマークと評価スクリプトはhttps://github.com/SCUT-DLVCLab/OCR-Reasoningで公開されている。

論文の概要: OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

関連論文リスト