KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
- URL: http://arxiv.org/abs/2505.16707v1
- Date: Thu, 22 May 2025 14:08:59 GMT
- Title: KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models
- Authors: Yongliang Wu, Zonghui Li, Xinting Hu, Xinyu Ye, Xianfang Zeng, Gang Yu, Wenbo Zhu, Bernt Schiele, Ming-Hsuan Yang, Xu Yang,
- Abstract summary: We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens.<n>We categorize editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural.<n>To support fine-grained evaluation, we propose a protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies.
- Score: 88.58758610679762
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, we introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1,267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on 10 state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.
Related papers
- How Well Do Models Follow Visual Instructions? VIBE: A Systematic Benchmark for Visual Instruction-Driven Image Editing [56.60465182650588]
We introduce three-level interaction hierarchy that captures deictic grounding, morphological manipulation, and causal reasoning.<n>We propose a robust LMM-as-a-judge evaluation framework with task-specific metrics to enable scalable and fine-grained assessment.<n>We find that proprietary models exhibit early-stage visual instruction-following capabilities and consistently outperform open-source models.
arXiv Detail & Related papers (2026-02-02T09:24:45Z) - I2I-Bench: A Comprehensive Benchmark Suite for Image-to-Image Editing Models [78.62380562116135]
Existing image editing benchmarks suffer from limited task scopes, insufficient evaluation dimensions, and heavy reliance on manual annotations.<n>We propose textbfI2I-Bench, a comprehensive benchmark for image-to-image editing models, which features 10 task categories across both single-image and multi-image editing tasks.<n>Using I2I-Bench, we benchmark numerous mainstream image editing models, investigating the gaps and trade-offs between editing models across various dimensions.
arXiv Detail & Related papers (2025-12-04T10:44:07Z) - WiseEdit: Benchmarking Cognition- and Creativity-Informed Image Editing [39.431195153927334]
WiseEdit is a knowledge-intensive benchmark for comprehensive evaluation of cognition- and creativity-informed image editing.<n>WiseEdit decomposes image editing into three cascaded steps, each corresponding to a task that poses a challenge for models to complete.<n>Ultimately, WiseEdit comprises 1,220 test cases, objectively revealing the limitations of SoTA image editing models.
arXiv Detail & Related papers (2025-11-29T08:32:35Z) - Q-REAL: Towards Realism and Plausibility Evaluation for AI-Generated Content [71.46991494014382]
We introduce Q-Real, a novel dataset for fine-grained evaluation of realism and plausibility in AI-generated images.<n>Q-Real consists of 3,088 images generated by popular text-to-image models.<n>We construct Q-Real Bench to evaluate them on two tasks: judgment and grounding with reasoning.
arXiv Detail & Related papers (2025-11-21T02:43:17Z) - Factuality Matters: When Image Generation and Editing Meet Structured Visuals [46.627460447235855]
We construct a large-scale dataset of 1.3 million high-quality structured image pairs.<n>We train a unified model that integrates a VLM with FLUX.1 Kontext.<n>A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation.
arXiv Detail & Related papers (2025-10-06T17:56:55Z) - Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing [18.12933371693374]
This paper introduces the concept of "superficial editing" to describe this phenomenon.<n>Our comprehensive evaluation reveals that this issue presents a significant challenge to existing algorithms.
arXiv Detail & Related papers (2025-05-19T02:44:57Z) - Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z) - PixLens: A Novel Framework for Disentangled Evaluation in Diffusion-Based Image Editing with Object Detection + SAM [17.89238060470998]
evaluating diffusion-based image-editing models is a crucial task in the field of Generative AI.
Our benchmark, PixLens, provides a comprehensive evaluation of both edit quality and latent representation disentanglement.
arXiv Detail & Related papers (2024-10-08T06:05:15Z) - I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing [67.05794909694649]
We propose I2EBench, a comprehensive benchmark to evaluate the quality of edited images produced by IIE models.
I2EBench consists of 2,000+ images for editing, along with 4,000+ corresponding original and diverse instructions.
We will open-source I2EBench, including all instructions, input images, human annotations, edited images from all evaluated methods, and a simple script for evaluating the results from new IIE models.
arXiv Detail & Related papers (2024-08-26T11:08:44Z) - Learning Action and Reasoning-Centric Image Editing from Videos and Simulations [45.637947364341436]
AURORA dataset is a collection of high-quality training data, human-annotated and curated from videos and simulation engines.
We evaluate an AURORA-finetuned model on a new expert-curated benchmark covering 8 diverse editing tasks.
Our model significantly outperforms previous editing models as judged by human raters.
arXiv Detail & Related papers (2024-07-03T19:36:33Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark [53.091690659399234]
knowledge editing on large language models (LLMs) has received considerable attention.
The existing LVLM editing benchmark, which comprises three metrics (Reliability, Locality, and Generality), falls short in the quality of synthesized evaluation images.
We employ more reliable data collection methods to construct a new Large $textbfV$ision-$textbfL$anguage Model.
arXiv Detail & Related papers (2024-03-12T06:16:33Z) - Recursive Counterfactual Deconfounding for Object Recognition [20.128093193861165]
We propose a Recursive Counterfactual Deconfounding model for object recognition in both closed-set and open-set scenarios.
We show that the proposed RCD model performs better than 11 state-of-the-art baselines significantly in most cases.
arXiv Detail & Related papers (2023-09-25T07:46:41Z) - GLUECons: A Generic Benchmark for Learning Under Constraints [102.78051169725455]
In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision.
We model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints.
arXiv Detail & Related papers (2023-02-16T16:45:36Z) - Image Quality Assessment in the Modern Age [53.19271326110551]
This tutorial provides the audience with the basic theories, methodologies, and current progresses of image quality assessment (IQA)
We will first revisit several subjective quality assessment methodologies, with emphasis on how to properly select visual stimuli.
Both hand-engineered and (deep) learning-based methods will be covered.
arXiv Detail & Related papers (2021-10-19T02:38:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.