Related papers: Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi

Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi

URL: http://arxiv.org/abs/2501.12900v1
Date: Wed, 22 Jan 2025 14:19:48 GMT
Title: Unified CNNs and transformers underlying learning mechanism reveals multi-head attention modus vivendi
Authors: Ella Koresh, Ronit D. Gross, Yuval Meir, Yarden Tzach, Tal Halevi, Ido Kanter,
Abstract summary: Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers.<n> vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers.<n>This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Convolutional neural networks (CNNs) evaluate short-range correlations in input images which progress along the layers, whereas vision transformer (ViT) architectures evaluate long-range correlations, using repeated transformer encoders composed of fully connected layers. Both are designed to solve complex classification tasks but from different perspectives. This study demonstrates that CNNs and ViT architectures stem from a unified underlying learning mechanism, which quantitatively measures the single-nodal performance (SNP) of each node in feedforward (FF) and multi-head attention (MHA) subblocks. Each node identifies small clusters of possible output labels, with additional noise represented as labels outside these clusters. These features are progressively sharpened along the transformer encoders, enhancing the signal-to-noise ratio. This unified underlying learning mechanism leads to two main findings. First, it enables an efficient applied nodal diagonal connection (ANDC) pruning technique without affecting the accuracy. Second, based on the SNP, spontaneous symmetry breaking occurs among the MHA heads, such that each head focuses its attention on a subset of labels through cooperation among its SNPs. Consequently, each head becomes an expert in recognizing its designated labels, representing a quantitative MHA modus vivendi mechanism. These results are based on a compact convolutional transformer architecture trained on the CIFAR-100 and Flowers-102 datasets and call for their extension to other architectures and applications, such as natural language processing.

Related papers

Low-latency vision transformers via large-scale multi-head attention [0.0]
A learning mechanism is generalized to large-scale MHA (LS-MHA) using a single matrix value representing single-head performance.<n>Several distinct vision transformer (ViT) architectures achieve the same accuracy but differ in their LS-MHA structures.<n>The extension of this learning mechanism to natural language processing tasks has the potential to yield new insights in deep learning.
arXiv Detail & Related papers (2025-06-30T13:23:46Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
We study how a two-attention-layer transformer is trained to perform ICL on $n$-gram Markov chain data. We prove that the gradient flow with respect to a cross-entropy ICL loss converges to a limiting model.
arXiv Detail & Related papers (2024-09-09T18:10:26Z)
You Only Train Once: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment [45.62136459502005]
We propose a network to perform full reference (FR) and no reference (NR) IQA. We first employ an encoder to extract multi-level features from input images. A Hierarchical Attention (HA) module is proposed as a universal adapter for both FR and NR inputs. A Semantic Distortion Aware (SDA) module is proposed to examine feature correlations between shallow and deep layers of the encoder.
arXiv Detail & Related papers (2023-10-14T11:03:04Z)
Convolution and Attention Mixer for Synthetic Aperture Radar Image Change Detection [41.38587746899477]
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community. Existing SAR change detection methods are mainly based on convolutional neural networks (CNNs) We propose a convolution and attention mixer (CAMixer) to incorporate global attention.
arXiv Detail & Related papers (2023-09-21T12:28:23Z)
A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers. Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module. Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z)
Transformer-CNN Cohort: Semi-supervised Semantic Segmentation by the Best of Both Students [18.860732413631887]
We propose a novel Semi-supervised Learning (SSL) approach that consists of two students with one based on the vision transformer (ViT) and the other based on the convolutional neural network (CNN) Our method subtly incorporates the multi-level consistency regularization on the predictions and the heterogeneous feature spaces via pseudo-labeling for the unlabeled data. We validate the TCC framework on Cityscapes and Pascal VOC 2012 datasets, which outperforms existing SSL methods by a large margin.
arXiv Detail & Related papers (2022-09-06T02:11:08Z)
Two-Stream Graph Convolutional Network for Intra-oral Scanner Image Segmentation [133.02190910009384]
We propose a two-stream graph convolutional network (i.e., TSGCN) to handle inter-view confusion between different raw attributes. Our TSGCN significantly outperforms state-of-the-art methods in 3D tooth (surface) segmentation.
arXiv Detail & Related papers (2022-04-19T10:41:09Z)
Hybrid Routing Transformer for Zero-Shot Learning [83.64532548391]
This paper presents a novel transformer encoder-decoder model, called hybrid routing transformer (HRT) We embed an active attention, which is constructed by both the bottom-up and the top-down dynamic routing pathways to generate the attribute-aligned visual feature. While in HRT decoder, we use static routing to calculate the correlation among the attribute-aligned visual features, the corresponding attribute semantics, and the class attribute vectors to generate the final class label predictions.
arXiv Detail & Related papers (2022-03-29T07:55:08Z)
CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning. The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z)
X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation. The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z)
Self-grouping Convolutional Neural Networks [30.732298624941738]
We propose a novel method of designing self-grouping convolutional neural networks, called SG-CNN. For each filter, we first evaluate the importance value of their input channels to identify the importance vectors. Using the resulting emphdata-dependent centroids, we prune the less important connections, which implicitly minimizes the accuracy loss of the pruning.
arXiv Detail & Related papers (2020-09-29T06:24:32Z)
Kernelized dense layers for facial expression recognition [10.98068123467568]
We propose a Kernelized Dense Layer (KDL) which captures higher order feature interactions instead of conventional linear relations. We show that our model achieves competitive results with respect to the state-of-the-art approaches.
arXiv Detail & Related papers (2020-09-22T21:02:00Z)
Dual-constrained Deep Semi-Supervised Coupled Factorization Network with Enriched Prior [80.5637175255349]
We propose a new enriched prior based Dual-constrained Deep Semi-Supervised Coupled Factorization Network, called DS2CF-Net. To ex-tract hidden deep features, DS2CF-Net is modeled as a deep-structure and geometrical structure-constrained neural network. Our network can obtain state-of-the-art performance for representation learning and clustering.
arXiv Detail & Related papers (2020-09-08T13:10:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.