Related papers: Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision Transformers

Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision Transformers

URL: http://arxiv.org/abs/2406.11534v1
Date: Mon, 17 Jun 2024 13:37:35 GMT
Title: Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision Transformers
Authors: Lokesh Badisa, Sumohana S. Channappayya,
Abstract summary: In this work, we propose a novel evaluation framework called textbfInpainting the Gaps (InG). InG is applied to the PartImageNet dataset to evaluate the performance of popular explanation methods for three training strategies of the Vision Transformer (ViT) To the best of our knowledge, InG is the first semi-synthetic framework for the evaluation of ViT explanation methods.
Score: 10.97134072427802
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The perturbation test remains the go-to evaluation approach for explanation methods in computer vision. This evaluation method has a major drawback of test-time distribution shift due to pixel-masking that is not present in the training set. To overcome this drawback, we propose a novel evaluation framework called \textbf{Inpainting the Gaps (InG)}. Specifically, we propose inpainting parts that constitute partial or complete objects in an image. In this way, one can perform meaningful image perturbations with lower test-time distribution shifts, thereby improving the efficacy of the perturbation test. InG is applied to the PartImageNet dataset to evaluate the performance of popular explanation methods for three training strategies of the Vision Transformer (ViT). Based on this evaluation, we found Beyond Intuition and Generic Attribution to be the two most consistent explanation models. Further, and interestingly, the proposed framework results in higher and more consistent evaluation scores across all the ViT models considered in this work. To the best of our knowledge, InG is the first semi-synthetic framework for the evaluation of ViT explanation methods.

Related papers

Hierarchical Scoring with 3D Gaussian Splatting for Instance Image-Goal Navigation [27.040017548286812]
Instance Image-Goal Navigation (IIN) requires autonomous agents to identify and navigate to a target object or location depicted in a reference image captured from any viewpoint.<n>We introduce a novel IIN framework with a hierarchical scoring paradigm that estimates optimal viewpoints for target matching.
arXiv Detail & Related papers (2025-06-09T00:58:14Z)
From Visual Explanations to Counterfactual Explanations with Latent Diffusion [11.433402357922414]
We propose a new approach to tackle two key challenges in recent prominent works. First, we determine which specific counterfactual features are crucial for distinguishing the "concept" of the target class from the original class. Second, we provide valuable explanations for the non-robust classifier without relying on the support of an adversarially robust model.
arXiv Detail & Related papers (2025-04-12T13:04:00Z)
A Meaningful Perturbation Metric for Evaluating Explainability Methods [55.09730499143998]
We introduce a novel approach, which harnesses image generation models to perform targeted perturbation.<n> Specifically, we focus on inpainting only the high-relevance pixels of an input image to modify the model's predictions while preserving image fidelity.<n>This is in contrast to existing approaches, which often produce out-of-distribution modifications, leading to unreliable results.
arXiv Detail & Related papers (2025-04-09T11:46:41Z)
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step [77.86514804787622]
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. We provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation.
arXiv Detail & Related papers (2025-01-23T18:59:43Z)
Grounding Descriptions in Images informs Zero-Shot Visual Recognition [47.66166611138081]
We propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods.
arXiv Detail & Related papers (2024-12-05T18:52:00Z)
Mismatched: Evaluating the Limits of Image Matching Approaches and Benchmarks [9.388897214344572]
Three-dimensional (3D) reconstruction from two-dimensional images is an active research field in computer vision. Traditionally, parametric techniques have been employed for this task. Recent advancements have seen a shift towards learning-based methods.
arXiv Detail & Related papers (2024-08-29T11:16:34Z)
Confidence-aware Reward Optimization for Fine-tuning Text-to-Image Models [85.96013373385057]
Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models. We propose TextNorm, a method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts.
arXiv Detail & Related papers (2024-04-02T11:40:38Z)
Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection [106.39544368711427]
We study the problem of generalizable synthetic image detection, aiming to detect forgery images from diverse generative methods. We present a novel forgery-aware adaptive transformer approach, namely FatFormer. Our approach tuned on 4-class ProGAN data attains an average of 98% accuracy to unseen GANs, and surprisingly generalizes to unseen diffusion models with 95% accuracy.
arXiv Detail & Related papers (2023-12-27T17:36:32Z)
HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models [56.112302700630806]
We introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations. We extend our method to a novel image editing task: substituting the subject in an image through textual manipulations.
arXiv Detail & Related papers (2023-11-30T02:33:29Z)
Benchmark Generation Framework with Customizable Distortions for Image Classifier Robustness [4.339574774938128]
We present a novel framework for generating adversarial benchmarks to evaluate the robustness of image classification models. Our framework allows users to customize the types of distortions to be optimally applied to images, which helps address the specific distortions relevant to their deployment.
arXiv Detail & Related papers (2023-10-28T07:40:42Z)
A Fine-Grained Image Description Generation Method Based on Joint Objectives [7.565093400979752]
We propose an innovative Fine-grained Image Description Generation model based on Joint Objectives. We introduce new object-based evaluation metrics to more intuitively assess the model's performance in handling description repetition and omission. Experimental results demonstrate that our proposed method significantly improves the CIDEr evaluation metric.
arXiv Detail & Related papers (2023-09-02T03:22:39Z)
Self-Supervised Pre-training of Vision Transformers for Dense Prediction Tasks [2.160196691362033]
We present a new self-supervised pre-training of Vision Transformers for dense prediction tasks. Our strategy produces better local features suitable for dense prediction tasks as opposed to contrastive pre-training based on global image representation only.
arXiv Detail & Related papers (2022-05-30T15:25:37Z)
Semantic keypoint-based pose estimation from single RGB frames [64.80395521735463]
We present an approach to estimating the continuous 6-DoF pose of an object from a single RGB image. The approach combines semantic keypoints predicted by a convolutional network (convnet) with a deformable shape model. We show that our approach can accurately recover the 6-DoF object pose for both instance- and class-based scenarios.
arXiv Detail & Related papers (2022-04-12T15:03:51Z)
Palette: Image-to-Image Diffusion Models [50.268441533631176]
We introduce Palette, a simple and general framework for image-to-image translation using conditional diffusion models. On four challenging image-to-image translation tasks, Palette outperforms strong GAN and regression baselines. We report several sample quality scores including FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against reference images.
arXiv Detail & Related papers (2021-11-10T17:49:29Z)
Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis [54.94682858474711]
Class Activation Mapping (CAM) approaches provide an effective visualization by taking weighted averages of the activation maps. We propose a novel set of metrics to quantify explanation maps, which show better effectiveness and simplify comparisons between approaches.
arXiv Detail & Related papers (2021-04-20T21:34:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.