MMR: Evaluating Reading Ability of Large Multimodal Models
- URL: http://arxiv.org/abs/2408.14594v1
- Date: Mon, 26 Aug 2024 19:26:50 GMT
- Title: MMR: Evaluating Reading Ability of Large Multimodal Models
- Authors: Jian Chen, Ruiyi Zhang, Yufan Zhou, Ryan Rossi, Jiuxiang Gu, Changyou Chen,
- Abstract summary: Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
- Score: 52.953316772123586
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.
Related papers
- Beyond Single Frames: Can LMMs Comprehend Temporal and Contextual Narratives in Image Sequences? [32.61269125015993]
StripCipher is a benchmark designed to evaluate capabilities of Large Multimodal Models (LMMs) to comprehend and reason over sequential images.
StripCipher comprises a human-annotated dataset and three challenging subtasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering.
Our evaluation of $16$ state-of-the-art LMMs, including GPT-4o and Qwen2.5VL, reveals a significant performance gap compared to human capabilities.
arXiv Detail & Related papers (2025-02-19T18:04:44Z) - OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning [72.57452266982642]
We introduce OCRBench v2, a large-scale bilingual text-centric benchmark for text recognition.
We find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations.
arXiv Detail & Related papers (2024-12-31T07:32:35Z) - SILMM: Self-Improving Large Multimodal Models for Compositional Text-to-Image Generation [92.73405185996315]
Large Multimodal Models (LMMs) have demonstrated impressive capabilities in multimodal understanding and generation.
Existing approaches, such as layout planning for multi-step generation and learning from human feedback or AI feedback, depend heavily on prompt engineering.
We introduce a model-agnostic iterative self-feedback framework (SILMM) that can enable LMMs to provide helpful and scalable self-improvement and optimize text-image alignment.
arXiv Detail & Related papers (2024-12-08T05:28:08Z) - MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective [32.55432949789787]
Large Multimodal Models (LMMs) have demonstrated remarkable capabilities.
We propose a straightforward automated evaluation pipeline that requires LMMs to generate an image-prompt from a given input image.
We then employ text-to-image generative models to create a new image based on these generated prompts.
Finally, we evaluate the performance of LMMs by comparing the original image with the generated one.
arXiv Detail & Related papers (2024-11-21T12:16:16Z) - Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.
The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.
We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning [48.63002688222462]
A gap remains in the domain of chart image understanding due to the distinct abstract components in charts.
We introduce a large-scale MultiModal Chart Instruction dataset comprising 600k instances supporting diverse tasks and chart types.
We develop MultiModal Chart Assistant (textbfMMC-A), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks.
arXiv Detail & Related papers (2023-11-15T23:36:42Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [153.37868034779385]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.