Related papers: DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

URL: http://arxiv.org/abs/2602.23622v1
Date: Fri, 27 Feb 2026 02:59:34 GMT
Title: DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model
Authors: Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen, Zhongyuan Peng, Chenhao Huang, Yixin Cao,
Abstract summary: This paper introduces DeepLookEditBench, the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects.<n>We construct a challenging testbed comprising 1889 samples across seven instruction types.<n>In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing.<n> Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.
Score: 10.609050605838805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

Related papers

RealDrag: The First Dragging Benchmark with Real Target Image [9.439854281295803]
textbfRealDrag is the first comprehensive benchmark for point based image editing that includes paired ground truth target images.<n>Our dataset contains over 400 human annotated samples from diverse video sources.<n>We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS)
arXiv Detail & Related papers (2025-12-13T11:14:03Z)
I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models [78.62380562116135]
Existing image editing benchmarks suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations.<n>We propose textbfI2I-Bench, a comprehensive benchmark for image-to-image editing models, which features 10 task categories across both single-image and multi-image editing tasks.<n>Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions.
arXiv Detail & Related papers (2025-12-04T10:44:07Z)
UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z)
EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing [170.71134330650796]
EdiVal-Agent is an object-centric evaluation framework for instruction-based image editing.<n>It is designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision.<n>We build EdiVal-Bench, a benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms.
arXiv Detail & Related papers (2025-09-16T17:45:39Z)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models [88.398085358514]
DICE is a model designed to detect localized differences between the original and the edited image.<n>It is trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision.<n>We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment.
arXiv Detail & Related papers (2025-05-26T18:00:10Z)
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z)
PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM [17.89238060470998]
evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI. Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement.
arXiv Detail & Related papers (2024-10-08T06:05:15Z)
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing [67.05794909694649]
We propose I2EBench, a comprehensive benchmark to evaluate the quality of edited images produced by IIE models. I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models.
arXiv Detail & Related papers (2024-08-26T11:08:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.