Related papers: KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

URL: http://arxiv.org/abs/2505.16707v1
Date: Thu, 22 May 2025 14:08:59 GMT
Title: KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
Authors: Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang,
Abstract summary: We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens.<n>We categorize editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural.<n>To support fine-grained evaluation, we propose a protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies.
Score: 88.58758610679762
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

Related papers

Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing [18.12933371693374]
This paper introduces the concept of "superficial editing" to describe this phenomenon.<n>Our comprehensive evaluation reveals that this issue presents a significant challenge to existing algorithms.
arXiv Detail & Related papers (2025-05-19T02:44:57Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM [17.89238060470998]
evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI. Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement.
arXiv Detail & Related papers (2024-10-08T06:05:15Z)
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing [67.05794909694649]
We propose I2EBench, a comprehensive benchmark to evaluate the quality of edited images produced by IIE models. I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions. We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models.
arXiv Detail & Related papers (2024-08-26T11:08:44Z)
Learning Action and Reasoning-Centric Image Editing from Videos and Simulations [45.637947364341436]
AURORA dataset is a collection of high-quality training data, human-annotated and curated from videos and simulation engines. We evaluate an AURORA-finetuned model on a new expert-curated benchmark covering 8 diverse editing tasks. Our model significantly outperforms previous editing models as judged by human raters.
arXiv Detail & Related papers (2024-07-03T19:36:33Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark [53.091690659399234]
knowledge editing on large language models (LLMs) has received considerable attention. The existing LVLM editing benchmark, which comprises three metrics (Reliability, Locality, and Generality), falls short in the quality of synthesized evaluation images. We employ more reliable data collection methods to construct a new Large $textbfV$ision-$textbfL$anguage Model.
arXiv Detail & Related papers (2024-03-12T06:16:33Z)
Recursive Counterfactual Deconfounding for Object Recognition [20.128093193861165]
We propose a Recursive Counterfactual Deconfounding model for object recognition in both closed-set and open-set scenarios. We show that the proposed RCD model performs better than 11 state-of-the-art baselines significantly in most cases.
arXiv Detail & Related papers (2023-09-25T07:46:41Z)
GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z)
Image Quality Assessment in the Modern Age [53.19271326110551]
This tutorial provides the audience with the basic theories, methodologies, and current progresses of image quality assessment (IQA) We will first revisit several subjective quality assessment methodologies, with emphasis on how to properly select visual stimuli. Both hand-engineered and (deep) learning-based methods will be covered.
arXiv Detail & Related papers (2021-10-19T02:38:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.