LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
- URL: http://arxiv.org/abs/2504.08358v1
- Date: Fri, 11 Apr 2025 08:46:49 GMT
- Title: LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
- Authors: Jiarui Wang, Huiyu Duan, Yu Zhao, Juntong Wang, Guangtao Zhai, Xiongkuo Min,
- Abstract summary: We present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation.<n>We propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions.
- Score: 52.79503055897109
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation, which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models. Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perception, text-image correspondence, and task-specific accuracy. Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. Both EvalMi-50K and LMM4LMM will be released at https://github.com/IntMeGroup/LMM4LMM.
Related papers
- OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning [72.57452266982642]
We introduce OCRBench v2, a large-scale bilingual text-centric benchmark for text recognition.<n>We find that 20 out of 22 LMMs score below 50 (100 in total) and suffer from five-type limitations.
arXiv Detail & Related papers (2024-12-31T07:32:35Z) - Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models [10.828419851213528]
We propose the Multi-Dimensional Insights benchmark, which includes over 500 images covering six common scenarios of human life.<n>This design allows for a detailed assessment of LMMs' capabilities in meeting the preferences and needs of different age groups.<n>Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs.
arXiv Detail & Related papers (2024-12-17T07:06:10Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
We introduce MM-Vet v2, which includes a new "image-text sequence understanding" capability called "image-text sequence understanding"<n>Using MM-Vet v2 to benchmark large multimodal models, we found that Claude 3.5 Sonnet is the best model with a score of 71.8, slightly outperforming GPT-4o which scored 71.0.
arXiv Detail & Related papers (2024-08-01T17:59:54Z) - ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [81.95879920888716]
We introduce ShareGPT4V, a dataset featuring 1.2 million descriptive captions.
This dataset surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations.
We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM.
arXiv Detail & Related papers (2023-11-21T18:58:11Z) - KOSMOS-2.5: A Multimodal Literate Model [136.96172068766285]
We present KOSMOS-2.5, a multimodal literate model for machine reading of text-intensive images.
KOSMOS-2.5 excels in two distinct yet complementary transcription tasks.
We fine-tune KOSMOS-2.5 for document understanding tasks, resulting in a document understanding generalist named KOSMOS-2.5-CHAT.
arXiv Detail & Related papers (2023-09-20T15:50:08Z) - WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models [2.603259641572195]
We introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs.
About 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation.
We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training.
arXiv Detail & Related papers (2022-03-22T06:12:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.