Related papers: IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment

IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment

URL: http://arxiv.org/abs/2511.18055v1
Date: Sat, 22 Nov 2025 13:16:58 GMT
Title: IE-Critic-R1: Advancing the Explanatory Measurement of Text-Driven Image Editing for Human Perception Alignment
Authors: Bowen Qu, Shangkun Sun, Xiaoyu Liang, Wei Gao,
Abstract summary: We introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images.<n>IE-Bench includes a database of diverse source images, various editing prompts and the corresponding edited results from different editing methods.<n>IE-Critic-R1 provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception.
Score: 14.001770505266116
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not well aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding edited results from different editing methods, and nearly 4,000 samples with corresponding Mean Opinion Scores (MOS) provided by 15 human subjects. Furthermore, we introduce IE-Critic-R1, which, benefiting from Reinforcement Learning from Verifiable Rewards (RLVR), provides more comprehensive and explainable quality assessment for text-driven image editing that aligns with human perception. Extensive experiments demonstrate IE-Critic-R1's superior subjective-alignments on the text-driven image editing task compared with previous metrics. Related data and codes are available to the public.

Related papers

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents [45.37806172594631]
multimodal image editing models have achieved substantial progress, enabling users to manipulate visual content in a flexible and interactive manner.<n> visual document image editing involves modifying textual content within images while preserving the original text style and background context.<n>Existing approaches, including AnyText, GlyphControl, and TextCtrl, predominantly focus on English-language scenarios and documents with relatively sparse textual layouts.
arXiv Detail & Related papers (2026-01-27T16:51:05Z)
TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering [18.337757379089037]
We introduce TextEditBench, a comprehensive evaluation benchmark for text-centric regions in images.<n>Our benchmark emphasizes reasoning-intensive editing scenarios that require models to understand physical plausibility, linguistic meaning, and cross-modal dependencies.<n>We also propose a novel evaluation dimension, Semantic Expectation, which measures reasoning ability of model to maintain semantic consistency, contextual coherence, and cross-modal alignment.
arXiv Detail & Related papers (2025-12-18T07:37:08Z)
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z)
DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models.<n>Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance.<n>To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z)
TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images.<n>We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z)
IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment [6.627422081288281]
We introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images.<n>IE-Bench includes a database containing diverse source images, various editing prompts and the corresponding results.<n>We also introduce IE-QA, a multi-modality source-aware quality assessment method for text-driven image editing.
arXiv Detail & Related papers (2025-01-17T02:47:25Z)
Visual question answering based evaluation metrics for text-to-image generation [7.105786967332924]
This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality.
arXiv Detail & Related papers (2024-11-15T13:32:23Z)
Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing [26.086806549826058]
Text-guided image editing seeks the preservation of core elements in the source image while implementing modifications based on the target text.<n>Existing metrics have a context-blindness problem, indiscriminately applying the same evaluation criteria on completely different pairs of source image and target text.<n>We propose AugCLIP, a context-aware metric that adaptively coordinates preservation and modification aspects.
arXiv Detail & Related papers (2024-10-15T08:12:54Z)
Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing [49.419619882284906]
Ground-A-Score is a powerful model-agnostic image editing method by incorporating grounding during score distillation. The selective application with a new penalty coefficient and contrastive loss helps to precisely target editing areas. Both qualitative assessments and quantitative analyses confirm that Ground-A-Score successfully adheres to the intricate details of extended and multifaceted prompts.
arXiv Detail & Related papers (2024-03-20T12:40:32Z)
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting. edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.