Related papers: EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing

EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing

URL: http://arxiv.org/abs/2509.13399v2
Date: Thu, 16 Oct 2025 01:09:09 GMT
Title: EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing
Authors: Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou,
Abstract summary: EdiVal-Agent is an object-centric evaluation framework for instruction-based image editing.<n>It is designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision.<n>We build EdiVal-Bench, a benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms.
Score: 170.71134330650796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images-resulting in limited coverage and inheriting biases from prior generative models-or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated and fine-grained evaluation framework grounded in an object-centric perspective, designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision. Given an input image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions while dynamically updating object pools across turns. These two stages enable two novel object-centric metrics tailored for multi-turn evaluation and one global metric of visual quality: (1) EdiVal-IF, which measures instruction following by combining open-vocabulary object detectors for symbolic checks with VLMs for semantic verification on detector-guided crops; (2) EdiVal-CC, which evaluates content consistency by calculating semantic similarity of unchanged objects and background using the evolving object pools; and (3) EdiVal-VQ, which quantifies changes in overall visual quality with human preference models. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models.

Related papers

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model [10.609050605838805]
This paper introduces DeepLookEditBench, the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects.<n>We construct a challenging testbed comprising 1889 samples across seven instruction types.<n>In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing.<n> Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.
arXiv Detail & Related papers (2026-02-27T02:59:34Z)
UniREditBench: A Unified Reasoning-based Image Editing Benchmark [52.54256348710893]
This work proposes UniREditBench, a unified benchmark for reasoning-based image editing evaluation.<n>It comprises 2,700 meticulously curated samples, covering both real- and game-world scenarios across 8 primary dimensions and 18 sub-dimensions.<n>We fine-tune Bagel on this dataset and develop UniREdit-Bagel, demonstrating substantial improvements in both in-domain and out-of-distribution settings.
arXiv Detail & Related papers (2025-11-03T07:24:57Z)
Semantic Visual Anomaly Detection and Reasoning in AI-Generated Images [96.43608872116347]
AnomReason is a large-scale benchmark with structured annotations as quadruple textbfAnomAgent<n>AnomReason and AnomAgent serve as a foundation for measuring and improving the semantic plausibility of AI-generated images.
arXiv Detail & Related papers (2025-10-11T14:09:24Z)
From Editor to Dense Geometry Estimator [77.21804448599009]
We introduce textbfFE2E, a framework that adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction.<n>FE2E achieves over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series, which is trained on 100$times$ data.
arXiv Detail & Related papers (2025-09-04T15:58:50Z)
DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception [0.846600473226587]
We introduce remote sensing image change analysis (RSICA) as a new paradigm that combines the strengths of change detection and visual question answering.<n>We propose DeltaVLM, an end-to-end architecture tailored for interactive RSICA.<n>DeltaVLM features three innovations: (1) a fine-tuned bi-temporal vision encoder to capture temporal differences; (2) a visual difference perception module with a cross-semantic relation measuring mechanism to interpret changes; and (3) an instruction-guided Q-former to effectively extract query-relevant difference information.
arXiv Detail & Related papers (2025-07-30T03:14:27Z)
IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering [7.247417417159471]
Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain.<n>We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition.
arXiv Detail & Related papers (2025-06-29T17:02:57Z)
BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models [2.526146573337397]
We propose a new evaluation methodology, inspired by ophthalmologic diagnostics.<n>We use procedural generation of synthetic images to obtain control over visual attributes.<n>This diagnostic allows systematic stress testing and fine-grained failure analysis.
arXiv Detail & Related papers (2025-06-05T12:43:10Z)
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models [88.398085358514]
DICE is a model designed to detect localized differences between the original and the edited image.<n>It is trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision.<n>We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment.
arXiv Detail & Related papers (2025-05-26T18:00:10Z)
Painting with Words: Elevating Detailed Image Captioning with Benchmark and Alignment Learning [56.31096024472269]
We introduce DeCapBench along with a novel metric, DCScore, specifically designed for detailed captioning tasks.<n>DCScore evaluates hallucinations and fine-grained comprehensiveness by deconstructing responses into the smallest self-sufficient units.<n>DeCapBench exhibits a high correlation with VLM arena results on descriptive tasks, surpassing existing benchmarks for vision-language models.
arXiv Detail & Related papers (2025-03-10T22:53:56Z)
HMGIE: Hierarchical and Multi-Grained Inconsistency Evaluation for Vision-Language Data Cleansing [54.970275599061594]
We design an adaptive evaluation framework, called Hierarchical and Multi-Grained Inconsistency Evaluation (HMGIE)<n>HMGIE can provide multi-grained evaluations covering both accuracy and completeness for various image-caption pairs.<n>To verify the efficacy and flexibility of the proposed framework, we construct MVTID, an image-caption dataset with diverse types and granularities of inconsistencies.
arXiv Detail & Related papers (2024-12-07T15:47:49Z)
Analyzing Local Representations of Self-supervised Vision Transformers [34.56680159632432]
We present a comparative analysis of various self-supervised Vision Transformers (ViTs) Inspired by large language models, we examine the abilities of ViTs to perform various computer vision tasks with little to no fine-tuning.
arXiv Detail & Related papers (2023-12-31T11:38:50Z)
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
TISE: A Toolbox for Text-to-Image Synthesis Evaluation [9.092600296992925]
We conduct a study on state-of-the-art methods for single- and multi-object text-to-image synthesis. We propose a common framework for evaluating these methods.
arXiv Detail & Related papers (2021-12-02T16:39:35Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.