Related papers: RealDrag: The First Dragging Benchmark with Real Target Image

RealDrag: The First Dragging Benchmark with Real Target Image

URL: http://arxiv.org/abs/2512.12287v1
Date: Sat, 13 Dec 2025 11:14:03 GMT
Title: RealDrag: The First Dragging Benchmark with Real Target Image
Authors: Ahmad Zafarani, Zahra Dehghanian, Mohammadreza Davoodi, Mohsen Shadroo, MohammadAmin Fazli, Hamid R. Rabiee,
Abstract summary: textbfRealDrag is the first comprehensive benchmark for point based image editing that includes paired ground truth target images.<n>Our dataset contains over 400 human annotated samples from diverse video sources.<n>We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS)
Score: 9.439854281295803
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evaluation of drag based image editing models is unreliable due to a lack of standardized benchmarks and metrics. This ambiguity stems from inconsistent evaluation protocols and, critically, the absence of datasets containing ground truth target images, making objective comparisons between competing methods difficult. To address this, we introduce \textbf{RealDrag}, the first comprehensive benchmark for point based image editing that includes paired ground truth target images. Our dataset contains over 400 human annotated samples from diverse video sources, providing source/target images, handle/target points, editable region masks, and descriptive captions for both the image and the editing action. We also propose four novel, task specific metrics: Semantical Distance (SeD), Outer Mask Preserving Score (OMPS), Inner Patch Preserving Score (IPPS), and Directional Similarity (DiS). These metrics are designed to quantify pixel level matching fidelity, check preservation of non edited (out of mask) regions, and measure semantic alignment with the desired task. Using this benchmark, we conduct the first large scale systematic analysis of the field, evaluating 17 SOTA models. Our results reveal clear trade offs among current approaches and establish a robust, reproducible baseline to guide future research. Our dataset and evaluation toolkit will be made publicly available.

Related papers

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model [10.609050605838805]
This paper introduces DeepLookEditBench, the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects.<n>We construct a challenging testbed comprising 1889 samples across seven instruction types.<n>In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing.<n> Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.
arXiv Detail & Related papers (2026-02-27T02:59:34Z)
UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z)
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions [55.95282725491425]
PoSh is a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge.<n>PoSh is replicable, interpretable and a better proxy for human raters than existing metrics.<n>We show that PoSh achieves stronger correlations with the human judgments in DOCENT than the best open-weight alternatives.
arXiv Detail & Related papers (2025-10-21T20:30:20Z)
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z)
Pluralistic Salient Object Detection [108.74650817891984]
We introduce pluralistic salient object detection (PSOD), a novel task aimed at generating multiple plausible salient segmentation results for a given input image. We present two new SOD datasets "DUTS-MM" and "DUS-MQ", along with newly designed evaluation metrics.
arXiv Detail & Related papers (2024-09-04T01:38:37Z)
Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing [4.948910649137149]
Diffusion Transformers (DiTs) have achieved remarkable success in diverse and high-quality text-to-image(T2I) generation. We investigate how text and image latents individually and jointly contribute to the semantics of generated images. We propose a simple and effective Extract-Manipulate-Sample framework for zero-shot fine-grained image editing.
arXiv Detail & Related papers (2024-08-23T19:00:52Z)
CrossScore: Towards Multi-View Image Evaluation and Scoring [24.853612457257697]
Cross-reference image quality assessment method fills the gap in the image assessment landscape. Our method enables accurate image quality assessment without requiring ground truth references.
arXiv Detail & Related papers (2024-04-22T17:59:36Z)
iEdit: Localised Text-guided Image Editing with Weak Supervision [53.082196061014734]
We propose a novel learning method for text-guided image editing. It generates images conditioned on a source image and a textual edit prompt. It shows favourable results against its counterparts in terms of image fidelity, CLIP alignment score and qualitatively for editing both generated and real images.
arXiv Detail & Related papers (2023-05-10T07:39:14Z)
MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation [104.40114562948428]
In unsupervised domain adaptation (UDA), a model trained on source data (e.g. synthetic) is adapted to target data (e.g. real-world) without access to target annotation. We propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain. MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA.
arXiv Detail & Related papers (2022-12-02T17:29:32Z)
Complex Scene Image Editing by Scene Graph Comprehension [17.72638225034884]
We propose a two-stage method for achieving complex scene image editing by Scene Graph (SGC-Net) In the first stage, we train a Region of Interest (RoI) prediction network that uses scene graphs and predict the locations of the target objects. The second stage uses a conditional diffusion model to edit the image based on our RoI predictions.
arXiv Detail & Related papers (2022-03-24T05:12:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.