Related papers: Mapping the Mind of an Instruction-based Image Editing using SMILE

Mapping the Mind of an Instruction-based Image Editing using SMILE

URL: http://arxiv.org/abs/2412.16277v1
Date: Fri, 20 Dec 2024 18:33:23 GMT
Title: Mapping the Mind of an Instruction-based Image Editing using SMILE
Authors: Zeinab Dehghani, Koorosh Aslansefat, Adil Khan, Adín Ramírez Rivera, Franky George, Muhammad Khalid,
Abstract summary: We introduce SMILE (Statistical Model-agnostic Interpretability with Local Explanations), a novel model-agnostic for localized interpretability.<n>We show how our model can improve interpretability and reliability.<n>These findings indicate the exciting potential of model-agnostic interpretability for reliability and trustworthiness in critical applications.
Score: 8.773288793688998
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Despite recent advancements in Instruct-based Image Editing models for generating high-quality images, they are known as black boxes and a significant barrier to transparency and user trust. To solve this issue, we introduce SMILE (Statistical Model-agnostic Interpretability with Local Explanations), a novel model-agnostic for localized interpretability that provides a visual heatmap to clarify the textual elements' influence on image-generating models. We applied our method to various Instruction-based Image Editing models like Pix2Pix, Image2Image-turbo and Diffusers-Inpaint and showed how our model can improve interpretability and reliability. Also, we use stability, accuracy, fidelity, and consistency metrics to evaluate our method. These findings indicate the exciting potential of model-agnostic interpretability for reliability and trustworthiness in critical applications such as healthcare and autonomous driving while encouraging additional investigation into the significance of interpretability in enhancing dependable image editing models.

Related papers

What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models [88.398085358514]
DICE is a model designed to detect localized differences between the original and the edited image.<n>It is trained using a strategy that leverages self-supervision, distillation from inpainting networks, and full supervision.<n>We demonstrate that DICE effectively identifies coherent edits, effectively evaluating images generated by different editing models with a strong correlation with human judgment.
arXiv Detail & Related papers (2025-05-26T18:00:10Z)
A Meaningful Perturbation Metric for Evaluating Explainability Methods [55.09730499143998]
We introduce a novel approach, which harnesses image generation models to perform targeted perturbation. Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity. This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results.
arXiv Detail & Related papers (2025-04-09T11:46:41Z)
Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging [0.768721532845575]
Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?'
arXiv Detail & Related papers (2025-03-30T22:49:26Z)
DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation [0.13124513975412253]
We present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models. Our approach begins by translating images into detailed textual descriptions using a captioning model. These descriptions are then used to produce new test images through a text-to-image diffusion process.
arXiv Detail & Related papers (2025-02-05T16:35:42Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [49.60774626839712]
multimodal generative models have sparked critical discussions on their fairness, reliability, and potential for misuse. We propose an evaluation framework designed to assess model reliability through their responses to perturbations in the embedding space. Our method lays the groundwork for detecting unreliable, bias-injected models and retrieval of bias provenance.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
Fine Tuning Text-to-Image Diffusion Models for Correcting Anomalous Images [0.0]
This study proposes a method to mitigate such issues by fine-tuning the Stable Diffusion 3 model using the DreamBooth technique. Experimental results targeting the prompt "lying on the grass/street" demonstrate that the fine-tuned model shows improved performance in visual evaluation and metrics such as Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Frechet Inception Distance (FID)
arXiv Detail & Related papers (2024-09-23T00:51:47Z)
Assessing Graphical Perception of Image Embedding Models using Channel Effectiveness [20.269583912221734]
We introduce a novel evaluation framework to assess the graphical perception of image embedding models. For chart comprehension, we examine two main aspects of channel effectiveness: accuracy and discriminability of various visual channels. Experiments with the CLIP model show that it perceives channel accuracy differently from humans and shows unique discriminability in channels like length, tilt, and curvature.
arXiv Detail & Related papers (2024-07-30T14:22:13Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation. Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z)
Diffexplainer: Towards Cross-modal Global Explanations with Diffusion Models [51.21351775178525]
DiffExplainer is a novel framework that, leveraging language-vision models, enables multimodal global explainability. It employs diffusion models conditioned on optimized text prompts, synthesizing images that maximize class outputs. The analysis of generated visual descriptions allows for automatic identification of biases and spurious features.
arXiv Detail & Related papers (2024-04-03T10:11:22Z)
Counterfactual Image Editing [54.21104691749547]
Counterfactual image editing is an important task in generative AI, which asks how an image would look if certain features were different. We formalize the counterfactual image editing task using formal language, modeling the causal relationships between latent generative factors and images. We develop an efficient algorithm to generate counterfactual images by leveraging neural causal models.
arXiv Detail & Related papers (2024-02-07T20:55:39Z)
Prompt-Propose-Verify: A Reliable Hand-Object-Interaction Data Generation Framework using Foundational Models [0.0]
Diffusion models when conditioned on text prompts, generate realistic-looking images with intricate details. But most of these pre-trained models fail to generate accurate images when it comes to human features like hands, teeth, etc.
arXiv Detail & Related papers (2023-12-23T12:59:22Z)
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models [32.29120988096214]
This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images. Our method modifies generative models based on each user's unique digital fingerprint, imprinting a unique identifier onto the resultant content that can be traced back to the user.
arXiv Detail & Related papers (2023-06-07T19:44:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.