T-TAME: Trainable Attention Mechanism for Explaining Convolutional
Networks and Vision Transformers
- URL: http://arxiv.org/abs/2403.04523v1
- Date: Thu, 7 Mar 2024 14:25:03 GMT
- Title: T-TAME: Trainable Attention Mechanism for Explaining Convolutional
Networks and Vision Transformers
- Authors: Mariano V. Ntrougkas, Nikolaos Gkalelis, Vasileios Mezaris
- Abstract summary: "Black box" nature of neural networks is a barrier to adoption in applications where explainability is essential.
This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations.
Proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network.
- Score: 9.284740716447342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The development and adoption of Vision Transformers and other deep-learning
architectures for image classification tasks has been rapid. However, the
"black box" nature of neural networks is a barrier to adoption in applications
where explainability is essential. While some techniques for generating
explanations have been proposed, primarily for Convolutional Neural Networks,
adapting such techniques to the new paradigm of Vision Transformers is
non-trivial. This paper presents T-TAME, Transformer-compatible Trainable
Attention Mechanism for Explanations, a general methodology for explaining deep
neural networks used in image classification tasks. The proposed architecture
and training technique can be easily applied to any convolutional or Vision
Transformer-like neural network, using a streamlined training approach. After
training, explanation maps can be computed in a single forward pass; these
explanation maps are comparable to or outperform the outputs of computationally
expensive perturbation-based explainability techniques, achieving SOTA
performance. We apply T-TAME to three popular deep learning classifier
architectures, VGG-16, ResNet-50, and ViT-B-16, trained on the ImageNet
dataset, and we demonstrate improvements over existing state-of-the-art
explainability methods. A detailed analysis of the results and an ablation
study provide insights into how the T-TAME design choices affect the quality of
the generated explanation maps.
Related papers
- Revolutionizing Traffic Sign Recognition: Unveiling the Potential of Vision Transformers [0.0]
Traffic Sign Recognition (TSR) holds a vital role in advancing driver assistance systems and autonomous vehicles.
This study explores three variants of Vision Transformers (PVT, TNT, LNL) and six convolutional neural networks (AlexNet, ResNet, VGG16, MobileNet, EfficientNet, GoogleNet) as baseline models.
To address the shortcomings of traditional methods, a novel pyramid EATFormer backbone is proposed, amalgamating Evolutionary Algorithms (EAs) with the Transformer architecture.
arXiv Detail & Related papers (2024-04-29T19:18:52Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Self-Supervised Pre-Training for Table Structure Recognition Transformer [25.04573593082671]
We propose a self-supervised pre-training (SSP) method for table structure recognition transformers.
We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
arXiv Detail & Related papers (2024-02-23T19:34:06Z) - TVE: Learning Meta-attribution for Transferable Vision Explainer [76.68234965262761]
We introduce a Transferable Vision Explainer (TVE) that can effectively explain various vision models in downstream tasks.
TVE is realized through a pre-training process on large-scale datasets towards learning the meta-attribution.
This meta-attribution leverages the versatility of generic backbone encoders to comprehensively encode the attribution knowledge for the input instance, which enables TVE to seamlessly transfer to explain various downstream tasks.
arXiv Detail & Related papers (2023-12-23T21:49:23Z) - Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components.
CNNs are used to augment the local texture information of coarse priors.
DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z) - NAR-Former V2: Rethinking Transformer for Universal Neural Network
Representation Learning [25.197394237526865]
We propose a modified Transformer-based universal neural network representation learning model NAR-Former V2.
Specifically, we take the network as a graph and design a straightforward tokenizer to encode the network into a sequence.
We incorporate the inductive representation learning capability of GNN into Transformer, enabling Transformer to generalize better when encountering unseen architecture.
arXiv Detail & Related papers (2023-06-19T09:11:04Z) - Centered Self-Attention Layers [89.21791761168032]
The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied.
We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers.
We present a correction term to the aggregating operator of these mechanisms.
arXiv Detail & Related papers (2023-06-02T15:19:08Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - Explainability-aided Domain Generalization for Image Classification [0.0]
We show that applying methods and architectures from the explainability literature can achieve state-of-the-art performance for the challenging task of domain generalization.
We develop a set of novel algorithms including DivCAM, an approach where the network receives guidance during training via gradient based class activation maps to focus on a diverse set of discriminative features.
Since these methods offer competitive performance on top of explainability, we argue that the proposed methods can be used as a tool to improve the robustness of deep neural network architectures.
arXiv Detail & Related papers (2021-04-05T02:27:01Z) - Investigating the Vision Transformer Model for Image Retrieval Tasks [1.375062426766416]
This paper introduces a plug-and-play descriptor that can be effectively adopted for image retrieval tasks without prior preparation.
The proposed description method utilizes the recently proposed Vision Transformer network while it does not require any training data to adjust parameters.
In image retrieval tasks, the use of global and local descriptors has been very successfully replaced, over the last years, by the Convolutional Neural Networks (CNN)-based methods.
arXiv Detail & Related papers (2021-01-11T08:59:54Z) - Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation.
We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters.
As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.