Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
- URL: http://arxiv.org/abs/2212.06872v5
- Date: Mon, 24 Jun 2024 04:47:38 GMT
- Title: Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods
- Authors: Mingqi Jiang, Saeed Khorram, Li Fuxin,
- Abstract summary: We study the decision-making of different visual recognition backbones by applying deep explanation algorithms on a dataset-wide basis.
We find that Transformers and ConvNeXt are found to be more compositional, in the sense that they jointly consider multiple parts of the image in building their decisions.
We plot a landscape of different models based on their feature-use similarity.
- Score: 4.661764541283174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to gain insights about the decision-making of different visual recognition backbones, we propose two methodologies, sub-explanation counting and cross-testing, that systematically applies deep explanation algorithms on a dataset-wide basis, and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional, in the sense that they jointly consider multiple parts of the image in building their decisions, whereas traditional CNNs and distilled transformers are less compositional and more disjunctive, which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments, we pinpointed the choice of normalization to be especially important in the compositionality of a model, in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally, we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.
Related papers
- Normalization in Proportional Feature Spaces [49.48516314472825]
normalization plays an important central role in data representation, characterization, visualization, analysis, comparison, classification, and modeling.
The selection of an appropriate normalization method needs to take into account the type and characteristics of the involved features.
arXiv Detail & Related papers (2024-09-17T17:46:27Z) - Revealing Multimodal Contrastive Representation Learning through Latent
Partial Causal Models [85.67870425656368]
We introduce a unified causal model specifically designed for multimodal data.
We show that multimodal contrastive representation learning excels at identifying latent coupled variables.
Experiments demonstrate the robustness of our findings, even when the assumptions are violated.
arXiv Detail & Related papers (2024-02-09T07:18:06Z) - Bayesian Unsupervised Disentanglement of Anatomy and Geometry for Deep Groupwise Image Registration [50.62725807357586]
This article presents a general Bayesian learning framework for multi-modal groupwise image registration.
We propose a novel hierarchical variational auto-encoding architecture to realise the inference procedure of the latent variables.
Experiments were conducted to validate the proposed framework, including four different datasets from cardiac, brain, and abdominal medical images.
arXiv Detail & Related papers (2024-01-04T08:46:39Z) - From Bricks to Bridges: Product of Invariances to Enhance Latent Space Communication [19.336940758147442]
It has been observed that representations learned by distinct neural networks conceal structural similarities when the models are trained under similar inductive biases.
We introduce a versatile method to directly incorporate a set of invariances into the representations, constructing a product space of invariant components on top of the latent representations.
We validate our solution on classification and reconstruction tasks, observing consistent latent similarity and downstream performance improvements in a zero-shot stitching setting.
arXiv Detail & Related papers (2023-10-02T13:55:38Z) - SO(2) and O(2) Equivariance in Image Recognition with
Bessel-Convolutional Neural Networks [63.24965775030674]
This work presents the development of Bessel-convolutional neural networks (B-CNNs)
B-CNNs exploit a particular decomposition based on Bessel functions to modify the key operation between images and filters.
Study is carried out to assess the performances of B-CNNs compared to other methods.
arXiv Detail & Related papers (2023-04-18T18:06:35Z) - Identifiability Results for Multimodal Contrastive Learning [72.15237484019174]
We show that it is possible to recover shared factors in a more general setup than the multi-view setting studied previously.
Our work provides a theoretical basis for multimodal representation learning and explains in which settings multimodal contrastive learning can be effective in practice.
arXiv Detail & Related papers (2023-03-16T09:14:26Z) - Interpretable Diversity Analysis: Visualizing Feature Representations In
Low-Cost Ensembles [0.0]
This paper introduces several interpretability methods that can be used to qualitatively analyze diversity.
We demonstrate these techniques by comparing the diversity of feature representations between child networks using two low-cost ensemble algorithms.
arXiv Detail & Related papers (2023-02-12T00:32:03Z) - Unsupervised Multimodal Change Detection Based on Structural
Relationship Graph Representation Learning [40.631724905575034]
Unsupervised multimodal change detection is a practical and challenging topic that can play an important role in time-sensitive emergency applications.
We take advantage of two types of modality-independent structural relationships in multimodal images.
We present a structural relationship graph representation learning framework for measuring the similarity of the two structural relationships.
arXiv Detail & Related papers (2022-10-03T13:55:08Z) - Fidelity of Ensemble Aggregation for Saliency Map Explanations using
Bayesian Optimization Techniques [0.0]
We present and compare different pixel-based aggregation schemes with the goal of generating a new explanation.
We incorporate the variance between the individual explanations into the aggregation process.
We also analyze the effect of multiple normalization techniques on ensemble aggregation.
arXiv Detail & Related papers (2022-07-04T16:34:12Z) - IMACS: Image Model Attribution Comparison Summaries [16.80986701058596]
We introduce IMACS, a method that combines gradient-based model attributions with aggregation and visualization techniques.
IMACS extracts salient input features from an evaluation dataset, clusters them based on similarity, then visualizes differences in model attributions for similar input features.
We show how our technique can uncover behavioral differences caused by domain shift between two models trained on satellite images.
arXiv Detail & Related papers (2022-01-26T21:35:14Z) - Semantic Change Detection with Asymmetric Siamese Networks [71.28665116793138]
Given two aerial images, semantic change detection aims to locate the land-cover variations and identify their change types with pixel-wise boundaries.
This problem is vital in many earth vision related tasks, such as precise urban planning and natural resource management.
We present an asymmetric siamese network (ASN) to locate and identify semantic changes through feature pairs obtained from modules of widely different structures.
arXiv Detail & Related papers (2020-10-12T13:26:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.