Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi
- URL: http://arxiv.org/abs/2501.12900v1
- Date: Wed, 22 Jan 2025 14:19:48 GMT
- Title: Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi
- Authors: Ella Koresh, Ronit D. Gross, Yuval Meir, Yarden Tzach, Tal Halevi, Ido Kanter,
- Abstract summary: Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers.
vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers.
This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism.
- Score: 0.0
- License:
- Abstract: Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers, whereas vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers. Both are designed to solve complex classification tasks but from different perspectives. This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism, which quantitatively measures the single-nodal performance (SNP) of each node in feedforward (FF) and multi-head attention (MHA) subblocks. Each node identifies small clusters of possible output labels, with additional noise represented as labels outside these clusters. These features are progressively sharpened along the transformer encoders, enhancing the signal-to-noise ratio. This unified underlying learning mechanism leads to two main findings. First, it enables an efficient applied nodal diagonal connection (ANDC) pruning technique without affecting the accuracy. Second, based on the SNP, spontaneous symmetry breaking occurs among the MHA heads, such that each head focuses its attention on a subset of labels through cooperation among its SNPs. Consequently, each head becomes an expert in recognizing its designated labels, representing a quantitative MHA modus vivendi mechanism. These results are based on a compact convolutional transformer architecture trained on the CIFAR-100 and Flowers-102 datasets and call for their extension to other architectures and applications, such as natural language processing.
Related papers
- Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data.
We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z) - Convolution and Attention Mixer for Synthetic Aperture Radar Image
Change Detection [41.38587746899477]
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community.
Existing SAR change detection methods are mainly based on convolutional neural networks (CNNs)
We propose a convolution and attention mixer (CAMixer) to incorporate global attention.
arXiv Detail & Related papers (2023-09-21T12:28:23Z) - Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the
Best of Both Students [18.860732413631887]
We propose a novel Semi-supervised Learning (SSL) approach that consists of two students with one based on the vision transformer (ViT) and the other based on the convolutional neural network (CNN)
Our method subtly incorporates the multi-level consistency regularization on the predictions and the heterogeneous feature spaces via pseudo-labeling for the unlabeled data.
We validate the TCC framework on Cityscapes and Pascal VOC 2012 datasets, which outperforms existing SSL methods by a large margin.
arXiv Detail & Related papers (2022-09-06T02:11:08Z) - Two-Stream Graph Convolutional Network for Intra-oral Scanner Image
Segmentation [133.02190910009384]
We propose a two-stream graph convolutional network (i.e., TSGCN) to handle inter-view confusion between different raw attributes.
Our TSGCN significantly outperforms state-of-the-art methods in 3D tooth (surface) segmentation.
arXiv Detail & Related papers (2022-04-19T10:41:09Z) - Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT)
We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature.
While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation.
The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z) - Self-grouping Convolutional Neural Networks [30.732298624941738]
We propose a novel method of designing self-grouping convolutional neural networks, called SG-CNN.
For each filter, we first evaluate the importance value of their input channels to identify the importance vectors.
Using the resulting emphdata-dependent centroids, we prune the less important connections, which implicitly minimizes the accuracy loss of the pruning.
arXiv Detail & Related papers (2020-09-29T06:24:32Z) - Dual-constrained Deep Semi-Supervised Coupled Factorization Network with
Enriched Prior [80.5637175255349]
We propose a new enriched prior based Dual-constrained Deep Semi-Supervised Coupled Factorization Network, called DS2CF-Net.
To ex-tract hidden deep features, DS2CF-Net is modeled as a deep-structure and geometrical structure-constrained neural network.
Our network can obtain state-of-the-art performance for representation learning and clustering.
arXiv Detail & Related papers (2020-09-08T13:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.