Related papers: I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models

URL: http://arxiv.org/abs/2512.04660v1
Date: Thu, 04 Dec 2025 10:44:07 GMT
Title: I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models
Authors: Juntong Wang, Jiarui Wang, Huiyu Duan, Jiaxiang Kang, Guangtao Zhai, Xiongkuo Min,
Abstract summary: Existing image editing benchmarks suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations.<n>We propose textbfI2I-Bench, a comprehensive benchmark for image-to-image editing models, which features 10 task categories across both single-image and multi-image editing tasks.<n>Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions.
Score: 78.62380562116135
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image editing models are advancing rapidly, yet comprehensive evaluation remains a significant challenge. Existing image editing benchmarks generally suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations, which significantly constrain their scalability and practical applicability. To address this, we propose \textbf{I2I-Bench}, a comprehensive benchmark for image-to-image editing models, which features (i) diverse tasks, encompassing 10 task categories across both single-image and multi-image editing tasks, (ii) comprehensive evaluation dimensions, including 30 decoupled and fine-grained evaluation dimensions with automated hybrid evaluation methods that integrate specialized tools and large multimodal models (LMMs), and (iii) rigorous alignment validation, justifying the consistency between our benchmark evaluations and human preferences. Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions. We will open-source all components of I2I-Bench to facilitate future research.

Related papers

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model [10.609050605838805]
This paper introduces DeepLookEditBench, the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects.<n>We construct a challenging testbed comprising 1889 samples across seven instruction types.<n>In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing.<n> Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.
arXiv Detail & Related papers (2026-02-27T02:59:34Z)
How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z)
UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models [88.398085358514]
DICE is a model designed to detect localized differences between the original and the edited image.<n>It is trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision.<n>We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment.
arXiv Detail & Related papers (2025-05-26T18:00:10Z)
ImgEdit: A Unified Image Editing Dataset and Benchmark [14.185771939071149]
We introduce ImgEdit, a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs.<n>ImgEdit surpasses existing datasets in both task novelty and data quality.<n>For comprehensive evaluation, we introduce ImgEdit-Bench, a benchmark designed to evaluate image editing performance.
arXiv Detail & Related papers (2025-05-26T17:53:33Z)
CompBench: Benchmarking Complex Instruction-guided Image Editing [63.347846732450364]
CompBench is a large-scale benchmark for complex instruction-guided image editing.<n>We propose an MLLM-human collaborative framework with tailored task pipelines.<n>We propose an instruction decoupling strategy that disentangles editing intents into four key dimensions.
arXiv Detail & Related papers (2025-05-18T02:30:52Z)
PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM [17.89238060470998]
evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI. Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement.
arXiv Detail & Related papers (2024-10-08T06:05:15Z)
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing [67.05794909694649]
We propose I2EBench, a comprehensive benchmark to evaluate the quality of edited images produced by IIE models. I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models.
arXiv Detail & Related papers (2024-08-26T11:08:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.