MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
- URL: http://arxiv.org/abs/2411.14062v1
- Date: Thu, 21 Nov 2024 12:16:16 GMT
- Title: MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective
- Authors: Hailang Huang, Yong Wang, Zixuan Huang, Huaqiu Li, Tongwen Huang, Xiangxiang Chu, Richong Zhang,
- Abstract summary: Large Multimodal Models (LMMs) have demonstrated remarkable capabilities.
We propose a straightforward automated evaluation pipeline that requires LMMs to generate an image-prompt from a given input image.
We then employ text-to-image generative models to create a new image based on these generated prompts.
Finally, we evaluate the performance of LMMs by comparing the original image with the generated one.
- Score: 32.55432949789787
- License:
- Abstract: Large Multimodal Models (LMMs) have demonstrated remarkable capabilities. While existing benchmarks for evaluating LMMs mainly focus on image comprehension, few works evaluate them from the image generation perspective. To address this issue, we propose a straightforward automated evaluation pipeline. Specifically, this pipeline requires LMMs to generate an image-prompt from a given input image. Subsequently, it employs text-to-image generative models to create a new image based on these generated prompts. Finally, we evaluate the performance of LMMs by comparing the original image with the generated one. Furthermore, we introduce MMGenBench-Test, a comprehensive benchmark developed to evaluate LMMs across 13 distinct image patterns, and MMGenBench-Domain, targeting the performance evaluation of LMMs within the generative image domain. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability in both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks, related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, our pipeline facilitates the efficient assessment of LMMs performance across diverse domains by using solely image inputs.
Related papers
- LoRA-Contextualizing Adaptation of Large Multimodal Models for Long Document Understanding [103.69014172427026]
Large multimodal models (LMMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page, visually-rich documents.
We present a novel framework named LoRA-Contextualizing Adaptation of Large multimodal models (LoCAL) which broadens the capabilities of any LMM to support long-document understanding.
arXiv Detail & Related papers (2024-11-02T02:09:01Z) - R-Bench: Are your Large Multimodal Model Robust to Real-world Corruptions? [86.94616033250068]
R-Bench is a benchmark focused on the **Real-world Robustness of LMMs**.
We show that while LMMs can correctly handle the original reference images, their performance is not stable when faced with distorted images.
We hope that R-Bench will inspire improving the robustness of LMMs, **extending them from experimental simulations to the real-world application**.
arXiv Detail & Related papers (2024-10-07T20:12:08Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.
The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.
We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - F-LMM: Grounding Frozen Large Multimodal Models [53.8059045627934]
We present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations.
Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits.
Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data.
arXiv Detail & Related papers (2024-06-09T15:14:26Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.