MMR: Evaluating Reading Ability of Large Multimodal Models
- URL: http://arxiv.org/abs/2408.14594v1
- Date: Mon, 26 Aug 2024 19:26:50 GMT
- Title: MMR: Evaluating Reading Ability of Large Multimodal Models
- Authors: Jian Chen, Ruiyi Zhang, Yufan Zhou, Ryan Rossi, Jiuxiang Gu, Changyou Chen,
- Abstract summary: Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
- Score: 52.953316772123586
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.
Related papers
- MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective [32.55432949789787]
Large Multimodal Models (LMMs) have demonstrated remarkable capabilities.
We propose a straightforward automated evaluation pipeline that requires LMMs to generate an image-prompt from a given input image.
We then employ text-to-image generative models to create a new image based on these generated prompts.
Finally, we evaluate the performance of LMMs by comparing the original image with the generated one.
arXiv Detail & Related papers (2024-11-21T12:16:16Z) - MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding [66.23502779435053]
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks.
Existing benchmarks either contain limited fine-grained evaluation samples mixed with other data, or are confined to object-level assessments in natural images.
We propose using document images with multi-granularity and multi-modal information to supplement natural images.
arXiv Detail & Related papers (2024-10-25T16:00:55Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.
The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.
We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - Hijacking Context in Large Multi-modal Models [3.6411220072843866]
We introduce a new limitation of off-the-shelf LMMs where a small fraction of incoherent images mislead LMMs to only generate biased output about the hijacked context.
We propose a pre-filtering method that removes irrelevant contexts via GPT-4V, based on its robustness towards distribution shift within the contexts.
We investigate whether replacing the hijacked visual and textual contexts with the correlated ones via GPT-4V and text-to-image models can help yield coherent responses.
arXiv Detail & Related papers (2023-12-07T11:23:29Z) - MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning [48.63002688222462]
A gap remains in the domain of chart image understanding due to the distinct abstract components in charts.
We introduce a large-scale MultiModal Chart Instruction dataset comprising 600k instances supporting diverse tasks and chart types.
We develop MultiModal Chart Assistant (textbfMMC-A), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks.
arXiv Detail & Related papers (2023-11-15T23:36:42Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.