Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?
- URL: http://arxiv.org/abs/2505.12766v1
- Date: Mon, 19 May 2025 06:45:18 GMT
- Title: Reasoning-OCR: Can Large Multimodal Models Solve Complex Logical Reasoning Problems from OCR Cues?
- Authors: Haibin He, Maoyuan Ye, Jing Zhang, Xiantao Cai, Juhua Liu, Bo Du, Dacheng Tao,
- Abstract summary: Reasoning-OCR challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text.<n>Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges.
- Score: 73.35232225256968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) have become increasingly versatile, accompanied by impressive Optical Character Recognition (OCR) related capabilities. Existing OCR-related benchmarks emphasize evaluating LMMs' abilities of relatively simple visual question answering, visual-text parsing, etc. However, the extent to which LMMs can deal with complex logical reasoning problems based on OCR cues is relatively unexplored. To this end, we introduce the Reasoning-OCR benchmark, which challenges LMMs to solve complex reasoning problems based on the cues that can be extracted from rich visual-text. Reasoning-OCR covers six visual scenarios and encompasses 150 meticulously designed questions categorized into six reasoning challenges. Additionally, Reasoning-OCR minimizes the impact of field-specialized knowledge. Our evaluation offers some insights for proprietary and open-source LMMs in different reasoning challenges, underscoring the urgent to improve the reasoning performance. We hope Reasoning-OCR can inspire and facilitate future research on enhancing complex reasoning ability based on OCR cues. Reasoning-OCR is publicly available at https://github.com/Hxyz-123/ReasoningOCR.
Related papers
- Improving MLLM's Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency [31.095908827004695]
Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks.<n>They struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges.<n>We introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept "Bilingual Cognitive Advantage"
arXiv Detail & Related papers (2025-07-11T05:02:06Z) - OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning [39.141660558608265]
OCR-Reasoning is a comprehensive benchmark designed to assess Multimodal Large Language Models on text-rich image reasoning tasks.<n>The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios.<n>With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes.
arXiv Detail & Related papers (2025-05-22T15:25:14Z) - LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images? [80.4577892387028]
We introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate LMMs' logical reasoning abilities on text-rich images.<n>We develop a scalable, automated pipeline to convert a text corpus into multimodal samples.<n>We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings.
arXiv Detail & Related papers (2025-05-18T08:39:37Z) - Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity [1.8130068086063336]
multimodal Large Language Models (LLMs) have attracted significant attention across various industrial fields.<n>In this work, we examine a context-independent OCR task using single-character images with diverse visual complexities.<n>Our findings reveal that multimodal LLMs can match conventional OCR methods at about 300 ppi, yet their performance deteriorates significantly below 150 ppi.
arXiv Detail & Related papers (2025-03-31T02:09:19Z) - MultiOCR-QA: Dataset for Evaluating Robustness of LLMs in Question Answering on Multilingual OCR Texts [17.20084584886653]
We introduce a multilingual QA dataset MultiOCR-QA, designed to analyze the effects of OCR noise on QA systems' performance.<n>The MultiOCR-QA dataset comprises 60K question-answer pairs covering three languages, English, French, and German.<n>Our findings show that QA systems are highly prone to OCR induced errors and exhibit performance degradation on noisy OCR text.
arXiv Detail & Related papers (2025-02-24T02:16:37Z) - MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.23935582919081]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)<n>We introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs.<n>We conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights.
arXiv Detail & Related papers (2025-02-13T18:59:46Z) - ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning [92.76959707441954]
We introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance.<n>ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity.<n>Our results reveal a significant decline in accuracy as problem complexity grows.
arXiv Detail & Related papers (2025-02-03T06:44:49Z) - Ocean-OCR: Towards General OCR Application via a Vision-Language Model [6.70908296002235]
We present textbfOcean-OCR, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks.<n>We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios.
arXiv Detail & Related papers (2025-01-26T15:20:39Z) - CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [50.78228433498211]
CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction.<n>It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time.<n>We evaluate nine prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition.
arXiv Detail & Related papers (2024-12-03T07:03:25Z) - ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs [95.15814662348245]
Compositional Reasoning (CR) entails grasping the significance of attributes, relations, and word order.
Recent Vision-Language Models (VLMs) have demonstrated remarkable proficiency in such reasoning tasks.
arXiv Detail & Related papers (2024-06-12T12:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.