Human-inspired Explanations for Vision Transformers and Convolutional Neural Networks
- URL: http://arxiv.org/abs/2408.02123v2
- Date: Tue, 20 Aug 2024 11:57:06 GMT
- Title: Human-inspired Explanations for Vision Transformers and Convolutional Neural Networks
- Authors: Mahadev Prasad Panda, Matteo Tiezzi, Martina Vilas, Gemma Roig, Bjoern M. Eskofier, Dario Zanca,
- Abstract summary: We introduce Foveation-based Explanations (FovEx), a novel human-inspired visual explainability (XAI) method for Deep Neural Networks.
Our method achieves state-of-the-art performance on both transformer (on 4 out of 5 metrics) and convolutional models, demonstrating its versatility.
- Score: 8.659674736978555
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We introduce Foveation-based Explanations (FovEx), a novel human-inspired visual explainability (XAI) method for Deep Neural Networks. Our method achieves state-of-the-art performance on both transformer (on 4 out of 5 metrics) and convolutional models (on 3 out of 5 metrics), demonstrating its versatility. Furthermore, we show the alignment between the explanation map produced by FovEx and human gaze patterns (+14\% in NSS compared to RISE, +203\% in NSS compared to gradCAM), enhancing our confidence in FovEx's ability to close the interpretation gap between humans and machines.
Related papers
- 4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration [31.111439909825627]
Existing methods typically model the dataset's action distribution using simple observations as inputs.<n>We propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to these sources of chaos.<n>Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
arXiv Detail & Related papers (2025-06-27T14:09:29Z) - UAVTwin: Neural Digital Twins for UAVs using Gaussian Splatting [57.63613048492219]
We present UAVTwin, a method for creating digital twins from real-world environments and facilitating data augmentation for training downstream models embedded in unmanned aerial vehicles (UAVs)
This is achieved by integrating 3D Gaussian Splatting (3DGS) for reconstructing backgrounds along with controllable synthetic human models that display diverse appearances and actions in multiple poses.
arXiv Detail & Related papers (2025-04-02T22:17:30Z) - Attribution for Enhanced Explanation with Transferable Adversarial eXploration [10.802449518516209]
AttEXplore++ enhances attribution by incorporating transferable adversarial attack methods.
We conduct experiments on five models, including CNNs (Inception-v3, ResNet-50, VGG16 and vision transformers) using the ImageNet dataset.
Our method achieves an average performance improvement of 7.57% over AttEXplore and 32.62% compared to other state-of-the-art interpretability algorithms.
arXiv Detail & Related papers (2024-12-27T08:27:53Z) - Convolution goes higher-order: a biologically inspired mechanism empowers image classification [0.8999666725996975]
We propose a novel approach to image classification inspired by complex nonlinear biological visual processing.<n>Our model incorporates a Volterra-like expansion of the convolution operator, capturing multiplicative interactions.<n>Our work bridges neuroscience and deep learning, offering a path towards more effective, biologically inspired computer vision models.
arXiv Detail & Related papers (2024-12-09T18:33:09Z) - Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models.<n>In this paper, we investigate how detection performance varies across model backbones, types, and datasets.<n>We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z) - FFHFlow: A Flow-based Variational Approach for Learning Diverse Dexterous Grasps with Shape-Aware Introspection [19.308304984645684]
We introduce a novel model that can generate diverse grasps for a multi-fingered hand.<n>The proposed idea gains superior performance and higher run-time efficiency against strong baselines.<n>We also demonstrate substantial benefits of greater diversity for grasping objects in clutter and a confined workspace in the real world.
arXiv Detail & Related papers (2024-07-21T13:33:08Z) - FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection [33.225938984092274]
We propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies.
We also design two Point Cloud Intensification ( PCI) strategies to compensate for the sparsity of point clouds.
We develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features.
arXiv Detail & Related papers (2024-07-14T09:39:44Z) - Opinion-Unaware Blind Image Quality Assessment using Multi-Scale Deep Feature Statistics [54.08757792080732]
We propose integrating deep features from pre-trained visual models with a statistical analysis model to achieve opinion-unaware BIQA (OU-BIQA)
Our proposed model exhibits superior consistency with human visual perception compared to state-of-the-art BIQA models.
arXiv Detail & Related papers (2024-05-29T06:09:34Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - ViTGaze: Gaze Following with Interaction Features in Vision Transformers [42.08842391756614]
We introduce a novel single-modality gaze following framework called ViTGaze.
In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders.
Our method achieves state-of-the-art (SOTA) performance among all single-modality methods.
arXiv Detail & Related papers (2024-03-19T14:45:17Z) - Manipulating Feature Visualizations with Gradient Slingshots [54.31109240020007]
We introduce a novel method for manipulating Feature Visualization (FV) without significantly impacting the model's decision-making process.
We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons.
arXiv Detail & Related papers (2024-01-11T18:57:17Z) - Human Trajectory Forecasting with Explainable Behavioral Uncertainty [63.62824628085961]
Human trajectory forecasting helps to understand and predict human behaviors, enabling applications from social robots to self-driving cars.
Model-free methods offer superior prediction accuracy but lack explainability, while model-based methods provide explainability but cannot predict well.
We show that BNSP-SFM achieves up to a 50% improvement in prediction accuracy, compared with 11 state-of-the-art methods.
arXiv Detail & Related papers (2023-07-04T16:45:21Z) - INTERACTION: A Generative XAI Framework for Natural Language Inference
Explanations [58.062003028768636]
Current XAI approaches only focus on delivering a single explanation.
This paper proposes a generative XAI framework, INTERACTION (explaIn aNd predicT thEn queRy with contextuAl CondiTional varIational autO-eNcoder)
Our novel framework presents explanation in two steps: (step one) Explanation and Label Prediction; and (step two) Diverse Evidence Generation.
arXiv Detail & Related papers (2022-09-02T13:52:39Z) - Deriving Explanation of Deep Visual Saliency Models [6.808418311272862]
We develop a technique to derive explainable saliency models from their corresponding deep neural architecture based saliency models.
We consider two state-of-the-art deep saliency models, namely UNISAL and MSI-Net for our interpretation.
We also build our own deep saliency model named cross-concatenated multi-scale residual block based network (CMRNet) for saliency prediction.
arXiv Detail & Related papers (2021-09-08T12:22:32Z) - Multi-Branch Deep Radial Basis Function Networks for Facial Emotion
Recognition [80.35852245488043]
We propose a CNN based architecture enhanced with multiple branches formed by radial basis function (RBF) units.
RBF units capture local patterns shared by similar instances using an intermediate representation.
We show it is the incorporation of local information what makes the proposed model competitive.
arXiv Detail & Related papers (2021-09-07T21:05:56Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Feature Alignment for Approximated Reversibility in Neural Networks [0.0]
We introduce feature alignment, a technique for obtaining approximate reversibility in artificial neural networks.
We show that the technique can be modified for training neural networks locally, saving computational memory resources.
arXiv Detail & Related papers (2021-06-23T17:42:47Z) - THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers [67.8628917474705]
THUNDR is a transformer-based deep neural network methodology to reconstruct the 3d pose and shape of people.
We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models.
We observe very solid 3d reconstruction performance for difficult human poses collected in the wild.
arXiv Detail & Related papers (2021-06-17T09:09:24Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Facial Emotion Recognition: State of the Art Performance on FER2013 [0.0]
We achieve the highest single-network classification accuracy on the FER2013 dataset.
Our model achieves state-of-the-art single-network accuracy of 73.28 % on FER2013 without using extra training data.
arXiv Detail & Related papers (2021-05-08T04:20:53Z) - E(n) Equivariant Graph Neural Networks [86.75170631724548]
This paper introduces a new model to learn graph neural networks equivariant to rotations, translations, reflections and permutations called E(n)-Equivariant Graph Neural Networks (EGNNs)
In contrast with existing methods, our work does not require computationally expensive higher-order representations in intermediate layers while it still achieves competitive or better performance.
arXiv Detail & Related papers (2021-02-19T10:25:33Z) - Explaining Convolutional Neural Networks through Attribution-Based Input
Sampling and Block-Wise Feature Aggregation [22.688772441351308]
Methods based on class activation mapping and randomized input sampling have gained great popularity.
However, the attribution methods provide lower resolution and blurry explanation maps that limit their explanation power.
In this work, we collect visualization maps from multiple layers of the model based on an attribution-based input sampling technique.
We also propose a layer selection strategy that applies to the whole family of CNN-based models.
arXiv Detail & Related papers (2020-10-01T20:27:30Z) - Deep Feature Consistent Variational Autoencoder [46.25741696270528]
We present a novel method for constructing Variational Autoencoder (VAE)
Instead of using pixel-by-pixel loss, we enforce deep feature consistency between the input and the output of a VAE.
We also show that our method can produce latent vectors that can capture the semantic information of face expressions.
arXiv Detail & Related papers (2016-10-02T15:48:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.