Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network Sparsity
- URL: http://arxiv.org/abs/2506.15541v1
- Date: Wed, 18 Jun 2025 15:14:56 GMT
- Title: Intrinsic and Extrinsic Organized Attention: Softmax Invariance and Network Sparsity
- Authors: Oluwadamilola Fasina, Ruben V. C. Pohle, Pei-Chun Su, Ronald R. Coifman,
- Abstract summary: We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers.<n>We use an existing methodology for hierarchical organization of tensors to examine network structure.
- Score: 1.837729564584369
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We examine the intrinsic (within the attention head) and extrinsic (amongst the attention heads) structure of the self-attention mechanism in transformers. Theoretical evidence for invariance of the self-attention mechanism to softmax activation is obtained by appealing to paradifferential calculus, (and is supported by computational examples), which relies on the intrinsic organization of the attention heads. Furthermore, we use an existing methodology for hierarchical organization of tensors to examine network structure by constructing hierarchal partition trees with respect to the query, key, and head axes of network 3-tensors. Such an organization is consequential since it allows one to profitably execute common signal processing tasks on a geometry where the organized network 3-tensors exhibit regularity. We exemplify this qualitatively, by visualizing the hierarchical organization of the tree comprised of attention heads and the diffusion map embeddings, and quantitatively by investigating network sparsity with the expansion coefficients of individual attention heads and the entire network with respect to the bi and tri-haar bases (respectively) on the space of queries, keys, and heads of the network. To showcase the utility of our theoretical and methodological findings, we provide computational examples using vision and language transformers. The ramifications of these findings are two-fold: (1) a subsequent step in interpretability analysis is theoretically admitted, and can be exploited empirically for downstream interpretability tasks (2) one can use the network 3-tensor organization for empirical network applications such as model pruning (by virtue of network sparsity) and network architecture comparison.
Related papers
- Geometry of Critical Sets and Existence of Saddle Branches for Two-layer Neural Networks [4.418361486640713]
This paper presents a comprehensive analysis of critical point sets in two-layer neural networks.
Given a critical point, we use these operators to uncover the whole underlying critical set representing the same output function.
We prove existence of saddle branches for any critical set whose output function can be represented by a narrower network.
arXiv Detail & Related papers (2024-05-26T02:32:28Z) - Unsupervised Graph Attention Autoencoder for Attributed Networks using
K-means Loss [0.0]
We introduce a simple, efficient, and clustering-oriented model based on unsupervised textbfGraph Attention textbfAutotextbfEncoder for community detection in attributed networks.
The proposed model adeptly learns representations from both the network's topology and attribute information, simultaneously addressing dual objectives: reconstruction and community discovery.
arXiv Detail & Related papers (2023-11-21T20:45:55Z) - Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks [44.31729147722701]
We show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks.<n>This work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.
arXiv Detail & Related papers (2023-05-11T17:19:30Z) - Rank Diminishing in Deep Neural Networks [71.03777954670323]
Rank of neural networks measures information flowing across layers.
It is an instance of a key structural condition that applies across broad domains of machine learning.
For neural networks, however, the intrinsic mechanism that yields low-rank structures remains vague and unclear.
arXiv Detail & Related papers (2022-06-13T12:03:32Z) - Functional Network: A Novel Framework for Interpretability of Deep
Neural Networks [2.641939670320645]
We propose a novel framework for interpretability of deep neural networks, that is, the functional network.
In our experiments, the mechanisms of regularization methods, namely, batch normalization and dropout, are revealed.
arXiv Detail & Related papers (2022-05-24T01:17:36Z) - Learning Structures for Deep Neural Networks [99.8331363309895]
We propose to adopt the efficient coding principle, rooted in information theory and developed in computational neuroscience.
We show that sparse coding can effectively maximize the entropy of the output signals.
Our experiments on a public image classification dataset demonstrate that using the structure learned from scratch by our proposed algorithm, one can achieve a classification accuracy comparable to the best expert-designed structure.
arXiv Detail & Related papers (2021-05-27T12:27:24Z) - A neural anisotropic view of underspecification in deep learning [60.119023683371736]
We show that the way neural networks handle the underspecification of problems is highly dependent on the data representation.
Our results highlight that understanding the architectural inductive bias in deep learning is fundamental to address the fairness, robustness, and generalization of these systems.
arXiv Detail & Related papers (2021-04-29T14:31:09Z) - Joint Learning of Neural Transfer and Architecture Adaptation for Image
Recognition [77.95361323613147]
Current state-of-the-art visual recognition systems rely on pretraining a neural network on a large-scale dataset and finetuning the network weights on a smaller dataset.
In this work, we prove that dynamically adapting network architectures tailored for each domain task along with weight finetuning benefits in both efficiency and effectiveness.
Our method can be easily generalized to an unsupervised paradigm by replacing supernet training with self-supervised learning in the source domain tasks and performing linear evaluation in the downstream tasks.
arXiv Detail & Related papers (2021-03-31T08:15:17Z) - Learning Connectivity of Neural Networks from a Topological Perspective [80.35103711638548]
We propose a topological perspective to represent a network into a complete graph for analysis.
By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner.
This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks.
arXiv Detail & Related papers (2020-08-19T04:53:31Z) - Neural networks adapting to datasets: learning network size and topology [77.34726150561087]
We introduce a flexible setup allowing for a neural network to learn both its size and topology during the course of a gradient-based training.
The resulting network has the structure of a graph tailored to the particular learning task and dataset.
arXiv Detail & Related papers (2020-06-22T12:46:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.