LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs
- URL: http://arxiv.org/abs/2507.16193v1
- Date: Tue, 22 Jul 2025 03:11:07 GMT
- Title: LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs
- Authors: Zitong Xu, Huiyu Duan, Bingnan Liu, Guangji Ma, Jiarui Wang, Liu Yang, Shiqi Gao, Xiaoyu Wang, Jia Wang, Xiongkuo Min, Guangtao Zhai, Weisi Lin,
- Abstract summary: We introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations.<n>EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs.<n>Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation
- Score: 76.57152007140475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of Text-guided Image Editing (TIE) enables image modifications through text prompts. However, current TIE models still struggle to balance image quality, editing alignment, and consistency with the original image, limiting their practical applications. Existing TIE evaluation benchmarks and metrics have limitations on scale or alignment with human perception. To this end, we introduce EBench-18K, the first large-scale image Editing Benchmark including 18K edited images with fine-grained human preference annotations for evaluating TIE. Specifically, EBench-18K includes 1,080 source images with corresponding editing prompts across 21 tasks, 18K+ edited images produced by 17 state-of-the-art TIE models, 55K+ mean opinion scores (MOSs) assessed from three evaluation dimensions, and 18K+ question-answering (QA) pairs. Based on EBench-18K, we employ outstanding LMMs to assess edited images, while the evaluation results, in turn, provide insights into assessing the alignment between the LMMs' understanding ability and human preferences. Then, we propose LMM4Edit, a LMM-based metric for evaluating image Editing models from perceptual quality, editing alignment, attribute preservation, and task-specific QA accuracy in an all-in-one manner. Extensive experiments show that LMM4Edit achieves outstanding performance and aligns well with human preference. Zero-shot validation on the other datasets also shows the generalization ability of our model. The dataset and code are available at https://github.com/IntMeGroup/LMM4Edit.
Related papers
- ImgEdit: A Unified Image Editing Dataset and Benchmark [14.185771939071149]
We introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.<n>ImgEdit surpasses existing datasets in both task novelty and data quality.<n>For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance.
arXiv Detail & Related papers (2025-05-26T17:53:33Z) - GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z) - Towards Scalable Human-aligned Benchmark for Text-guided Image Editing [9.899869794429579]
We introduce a novel Human-Aligned benchmark for Text-guided Image Editing (HATIE)<n>HATIE provides a fully-automated and omnidirectional evaluation pipeline.<n>We empirically verify that the evaluation of HATIE is indeed human-aligned in various aspects.
arXiv Detail & Related papers (2025-05-01T13:06:05Z) - LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs [52.79503055897109]
We present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation.<n>We propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions.
arXiv Detail & Related papers (2025-04-11T08:46:49Z) - MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding [66.23502779435053]
Large Vision-Language Models (LVLMs) have achieved remarkable performance in many vision-language tasks.
Existing benchmarks either contain limited fine-grained evaluation samples mixed with other data, or are confined to object-level assessments in natural images.
We propose using document images with multi-granularity and multi-modal information to supplement natural images.
arXiv Detail & Related papers (2024-10-25T16:00:55Z) - I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing [67.05794909694649]
We propose I2EBench, a comprehensive benchmark to evaluate the quality of edited images produced by IIE models.
I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions.
We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models.
arXiv Detail & Related papers (2024-08-26T11:08:44Z) - HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing [38.13162627140172]
HQ-Edit is a high-quality instruction-based image editing dataset with around 200,000 edits.
To ensure its high quality, diverse examples are first collected online, expanded, and then used to create high-quality diptychs.
HQ-Edits high-resolution images, rich in detail and accompanied by comprehensive editing prompts, substantially enhance the capabilities of existing image editing models.
arXiv Detail & Related papers (2024-04-15T17:59:31Z) - EditVal: Benchmarking Diffusion Based Text-Guided Image Editing Methods [52.43439659492655]
We introduce EditVal, a standardized benchmark for quantitatively evaluating text-guided image editing methods.
EditVal consists of a curated dataset of images, a set of editable attributes for each image drawn from 13 possible edit types, and an automated evaluation pipeline.
We use EditVal to benchmark 8 cutting-edge diffusion-based editing methods including SINE, Imagic and Instruct-Pix2Pix.
arXiv Detail & Related papers (2023-10-03T20:46:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.