Related papers: T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers

T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers

URL: http://arxiv.org/abs/2403.04523v1
Date: Thu, 7 Mar 2024 14:25:03 GMT
Title: T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers
Authors: Mariano V. Ntrougkas, Nikolaos Gkalelis, Vasileios Mezaris
Abstract summary: "Black box" nature of neural networks is a barrier to adoption in applications where explainability is essential. This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations. Proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network.
Score: 9.284740716447342
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The development and adoption of Vision Transformers and other deep-learning architectures for image classification tasks has been rapid. However, the "black box" nature of neural networks is a barrier to adoption in applications where explainability is essential. While some techniques for generating explanations have been proposed, primarily for Convolutional Neural Networks, adapting such techniques to the new paradigm of Vision Transformers is non-trivial. This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations, a general methodology for explaining deep neural networks used in image classification tasks. The proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network, using a streamlined training approach. After training, explanation maps can be computed in a single forward pass; these explanation maps are comparable to or outperform the outputs of computationally expensive perturbation-based explainability techniques, achieving SOTA performance. We apply T-TAME to three popular deep learning classifier architectures, VGG-16, ResNet-50, and ViT-B-16, trained on the ImageNet dataset, and we demonstrate improvements over existing state-of-the-art explainability methods. A detailed analysis of the results and an ablation study provide insights into how the T-TAME design choices affect the quality of the generated explanation maps.

Related papers

BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN) We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations. Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z)
Inverting Transformer-based Vision Models [0.8124699127636158]
We apply a modular approach of training inverse models to reconstruct input images from intermediate layers within a Detection Transformer and a Vision Transformer. Our analysis illustrates how these properties emerge within the models, contributing to a deeper understanding of transformer-based vision models.
arXiv Detail & Related papers (2024-12-09T14:43:06Z)
Revolutionizing Traffic Sign Recognition: Unveiling the Potential of Vision Transformers [0.0]
Traffic Sign Recognition (TSR) holds a vital role in advancing driver assistance systems and autonomous vehicles. This study explores three variants of Vision Transformers (PVT, TNT, LNL) and six convolutional neural networks (AlexNet, ResNet, VGG16, MobileNet, EfficientNet, GoogleNet) as baseline models. To address the shortcomings of traditional methods, a novel pyramid EATFormer backbone is proposed, amalgamating Evolutionary Algorithms (EAs) with the Transformer architecture.
arXiv Detail & Related papers (2024-04-29T19:18:52Z)
TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture. To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer. In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z)
Self-Supervised Pre-Training for Table Structure Recognition Transformer [25.04573593082671]
We propose a self-supervised pre-training (SSP) method for table structure recognition transformers. We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
arXiv Detail & Related papers (2024-02-23T19:34:06Z)
TVE: Learning Meta-attribution for Transferable Vision Explainer [76.68234965262761]
We introduce a Transferable Vision Explainer (TVE) that can effectively explain various vision models in downstream tasks. TVE is realized through a pre-training process on large-scale datasets towards learning the meta-attribution. This meta-attribution leverages the versatility of generic backbone encoders to comprehensively encode the attribution knowledge for the input instance, which enables TVE to seamlessly transfer to explain various downstream tasks.
arXiv Detail & Related papers (2023-12-23T21:49:23Z)
Distance Weighted Trans Network for Image Completion [52.318730994423106]
We propose a new architecture that relies on Distance-based Weighted Transformer (DWT) to better understand the relationships between an image's components. CNNs are used to augment the local texture information of coarse priors. DWT blocks are used to recover certain coarse textures and coherent visual structures.
arXiv Detail & Related papers (2023-10-11T12:46:11Z)
NAR-Former V2: Rethinking Transformer for Universal Neural Network Representation Learning [25.197394237526865]
We propose a modified Transformer-based universal neural network representation learning model NAR-Former V2. Specifically, we take the network as a graph and design a straightforward tokenizer to encode the network into a sequence. We incorporate the inductive representation learning capability of GNN into Transformer, enabling Transformer to generalize better when encountering unseen architecture.
arXiv Detail & Related papers (2023-06-19T09:11:04Z)
Centered Self-Attention Layers [89.21791761168032]
The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied. We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers. We present a correction term to the aggregating operator of these mechanisms.
arXiv Detail & Related papers (2023-06-02T15:19:08Z)
Cross-receptive Focused Inference Network for Lightweight Image Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks. Transformers that need to incorporate contextual information to extract features dynamically are neglected. We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z)
Explainability-aided Domain Generalization for Image Classification [0.0]
We show that applying methods and architectures from the explainability literature can achieve state-of-the-art performance for the challenging task of domain generalization. We develop a set of novel algorithms including DivCAM, an approach where the network receives guidance during training via gradient based class activation maps to focus on a diverse set of discriminative features. Since these methods offer competitive performance on top of explainability, we argue that the proposed methods can be used as a tool to improve the robustness of deep neural network architectures.
arXiv Detail & Related papers (2021-04-05T02:27:01Z)
Investigating the Vision Transformer Model for Image Retrieval Tasks [1.375062426766416]
This paper introduces a plug-and-play descriptor that can be effectively adopted for image retrieval tasks without prior preparation. The proposed description method utilizes the recently proposed Vision Transformer network while it does not require any training data to adjust parameters. In image retrieval tasks, the use of global and local descriptors has been very successfully replaced, over the last years, by the Convolutional Neural Networks (CNN)-based methods.
arXiv Detail & Related papers (2021-01-11T08:59:54Z)
Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation. We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters. As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.