Related papers: IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment

IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment

URL: http://arxiv.org/abs/2501.09927v1
Date: Fri, 17 Jan 2025 02:47:25 GMT
Title: IE-Bench: Advancing the Measurement of Text-Driven Image Editing for Human Perception Alignment
Authors: Shangkun Sun, Bowen Qu, Xiaoyu Liang, Songlin Fan, Wei Gao,
Abstract summary: We introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images.<n>IE-Bench includes a database containing diverse source images, various editing prompts and the corresponding results.<n>We also introduce IE-QA, a multi-modality source-aware quality assessment method for text-driven image editing.
Score: 6.627422081288281
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-driven image editing have been significant, yet the task of accurately evaluating these edited images continues to pose a considerable challenge. Different from the assessment of text-driven image generation, text-driven image editing is characterized by simultaneously conditioning on both text and a source image. The edited images often retain an intrinsic connection to the original image, which dynamically change with the semantics of the text. However, previous methods tend to solely focus on text-image alignment or have not aligned with human perception. In this work, we introduce the Text-driven Image Editing Benchmark suite (IE-Bench) to enhance the assessment of text-driven edited images. IE-Bench includes a database contains diverse source images, various editing prompts and the corresponding results different editing methods, and total 3,010 Mean Opinion Scores (MOS) provided by 25 human subjects. Furthermore, we introduce IE-QA, a multi-modality source-aware quality assessment method for text-driven image editing. To the best of our knowledge, IE-Bench offers the first IQA dataset and model tailored for text-driven image editing. Extensive experiments demonstrate IE-QA's superior subjective-alignments on the text-driven image editing task compared with previous metrics. We will make all related data and code available to the public.

Related papers

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits [22.762414256693265]
We introduce EditInspector, a novel benchmark for evaluation of text-guided image edits.<n>We leverage EditInspector to evaluate the performance of state-of-the-art (SoTA) vision and language models in assessing edits.<n>Our findings indicate that current models struggle to evaluate edits comprehensively and frequently hallucinate when describing the changes.
arXiv Detail & Related papers (2025-06-11T17:58:25Z)
GIE-Bench: Towards Grounded Evaluation for Text-Guided Image Editing [60.66800567924348]
We introduce a new benchmark designed to evaluate text-guided image editing models.<n>The benchmark includes over 1000 high-quality editing examples across 20 diverse content categories.<n>We conduct a large-scale study comparing GPT-Image-1 against several state-of-the-art editing models.
arXiv Detail & Related papers (2025-05-16T17:55:54Z)
DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models. Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z)
TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark [61.412934963260724]
Existing diffusion-based text-to-image models often struggle to accurately embed text within images. We introduce TextInVision, a large-scale, text and prompt complexity driven benchmark to evaluate the ability of diffusion models to integrate visual text into images.
arXiv Detail & Related papers (2025-03-17T21:36:31Z)
Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing [26.086806549826058]
We propose textttAugCLIP, a textbfcontext-aware metric that adaptively coordinates preservation and modification aspects.<n>textttAugCLIP aligns remarkably well with human evaluation standards, outperforming existing metrics.
arXiv Detail & Related papers (2024-10-15T08:12:54Z)
ParallelEdits: Efficient Multi-Aspect Text-Driven Image Editing with Attention Grouping [31.026083872774834]
ParallelEdits is a method that seamlessly manages simultaneous edits across multiple attributes. PIE-Bench++ dataset is a benchmark for evaluating text-driven image editing methods in multifaceted scenarios.
arXiv Detail & Related papers (2024-06-03T04:43:56Z)
DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images [55.546024767130994]
We propose a novel model to enhance the text-based control of an image editor by explicitly reasoning about which parts of the image to alter or preserve. It relies on word alignments between a description of the original source image and the instruction that reflects the needed updates, and the input image. It is evaluated on a subset of the Bison dataset and a self-defined dataset dubbed Dream.
arXiv Detail & Related papers (2024-04-27T22:45:47Z)
E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance [13.535394339438428]
Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications. We propose a zero-shot image editing method, named textbfEnhance textbfEditability for text-based image textbfEditing via textbfCLIP guidance.
arXiv Detail & Related papers (2024-03-15T09:26:48Z)
FICE: Text-Conditioned Fashion Image Editing With Guided GAN Inversion [16.583537785874604]
We propose a novel text-conditioned editing model, called FICE, capable of handling a wide variety of diverse text descriptions. FICE generates highly realistic fashion images and leads to stronger editing performance than existing competing approaches.
arXiv Detail & Related papers (2023-01-05T15:33:23Z)
Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting [53.708523312636096]
We present Imagen Editor, a cascaded diffusion model built, by fine-tuning on text-guided image inpainting. edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting.
arXiv Detail & Related papers (2022-12-13T21:25:11Z)
ManiCLIP: Multi-Attribute Face Manipulation from Text [104.30600573306991]
We present a novel multi-attribute face manipulation method based on textual descriptions. Our method generates natural manipulated faces with minimal text-irrelevant attribute editing.
arXiv Detail & Related papers (2022-10-02T07:22:55Z)
Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors [58.71128866226768]
Recent text-to-image generation methods have incrementally improved the generated image fidelity and text relevancy. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels.
arXiv Detail & Related papers (2022-03-24T15:44:50Z)
Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.