Related papers: OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

URL: http://arxiv.org/abs/2501.00321v1
Date: Tue, 31 Dec 2024 07:32:35 GMT
Title: OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Authors: Ling Fu, Biao Yang, Zhebin Kuang, Jiajun Song, Yuzhe Li, Linghao Zhu, Qidi Luo, Xinyu Wang, Hao Lu, Mingxin Huang, Zhang Li, Guozhi Tang, Bin Shan, Chunhui Lin, Qi Liu, Binghong Wu, Hao Feng, Hao Liu, Can Huang, Jingqun Tang, Wei Chen, Lianwen Jin, Yuliang Liu, Xiang Bai,
Abstract summary: We introduce OCRBench v2, a large-scale bilingual text-centric benchmark for text recognition.<n>We find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations.
Score: 72.57452266982642
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest recently. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities on certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples. After carefully benchmarking state-of-the-art LMMs on OCRBench v2, we find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The benchmark and evaluation scripts are available at https://github.com/Yuliang-liu/MultimodalOCR.

Related papers

LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs [52.79503055897109]
We present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation. We propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions.
arXiv Detail & Related papers (2025-04-11T08:46:49Z)
Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models [10.828419851213528]
We propose the Multi-Dimensional Insights benchmark, which includes over 500 images covering six common scenarios of human life.<n>This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups.<n>Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs.
arXiv Detail & Related papers (2024-12-17T07:06:10Z)
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z)
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios [60.492736455572015]
We present UrBench, a benchmark designed for evaluating LMMs in complex multi-view urban scenarios.<n>UrBench contains 11.6K meticulously curated questions at both region-level and role-level.<n>Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
arXiv Detail & Related papers (2024-08-30T13:13:35Z)
MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Current benchmarks fail to accurately reflect performance of different models. We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z)
Benchmarking Large Multimodal Models against Common Corruptions [45.26424202601339]
This technical report aims to fill a deficiency in the assessment of large multimodal models (LMMs) We investigate the cross-modal interactions between text, image, and speech, encompassing four essential generation tasks. We create a benchmark, named MMCBench, that covers more than 100 popular LMMs.
arXiv Detail & Related papers (2024-01-22T13:33:53Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.