Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals
- URL: http://arxiv.org/abs/2503.06473v3
- Date: Sat, 22 Mar 2025 12:05:30 GMT
- Title: Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals
- Authors: Hanze Li, Xiande Huang,
- Abstract summary: We propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers.<n>We also introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers.<n>Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30% reduction in training time.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30\% reduction in training time while enhancing performance in tasks such as image classification and object detection.
Related papers
- Towards the Training of Deeper Predictive Coding Neural Networks [53.15874572081944]
Predictive coding networks trained with equilibrium propagation are neural models that perform inference through an iterative energy process.<n>Previous studies have demonstrated their effectiveness in shallow architectures, but show significant performance degradation when depth exceeds five to seven layers.<n>We show that the reason behind this degradation is due to exponentially imbalanced errors between layers during weight updates, and predictions from the previous layer not being effective in guiding updates in deeper layers.
arXiv Detail & Related papers (2025-06-30T12:44:47Z) - Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning [57.3885832382455]
We show that introducing static network sparsity alone can unlock further scaling potential beyond dense counterparts with state-of-the-art architectures.<n>Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity.
arXiv Detail & Related papers (2025-06-20T17:54:24Z) - Auto-Compressing Networks [59.83547898874152]
We introduce Auto- Networks (ACNs), an architectural variant where additive long feedforward connections from each layer replace traditional short residual connections.<n>ACNs showcase unique property we coin as "auto-compression", the ability of a network to organically compress information during training.<n>We find that ACNs exhibit enhanced noise robustness compared to residual networks, superior performance in low-data settings, and mitigate catastrophic forgetting.
arXiv Detail & Related papers (2025-06-11T13:26:09Z) - Strengthening Layer Interaction via Dynamic Layer Attention [12.341997220052486]
Existing layer attention methods achieve layer interaction on fixed feature maps in a static manner.
To restore the dynamic context representation capability of the attention mechanism, we propose a Dynamic Layer Attention architecture.
Experimental results demonstrate the effectiveness of the proposed DLA architecture, outperforming other state-of-the-art methods in image recognition and object detection tasks.
arXiv Detail & Related papers (2024-06-19T09:35:14Z) - Learning Sparse Neural Networks with Identity Layers [33.11654855515443]
We investigate the intrinsic link between network sparsity and interlayer feature similarity.
We propose a plug-and-play CKA-based Sparsity Regularization for sparse network training, dubbed CKA-SR.
We find that CKA-SR consistently improves the performance of several State-Of-The-Art sparse training methods.
arXiv Detail & Related papers (2023-07-14T14:58:44Z) - Centered Self-Attention Layers [89.21791761168032]
The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied.
We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers.
We present a correction term to the aggregating operator of these mechanisms.
arXiv Detail & Related papers (2023-06-02T15:19:08Z) - Sharpness-Aware Minimization Leads to Low-Rank Features [49.64754316927016]
Sharpness-aware minimization (SAM) is a recently proposed method that minimizes the training loss of a neural network.
We show that SAM reduces the feature rank which happens at different layers of a neural network.
We confirm this effect theoretically and check that it can also occur in deep networks.
arXiv Detail & Related papers (2023-05-25T17:46:53Z) - Dense Network Expansion for Class Incremental Learning [61.00081795200547]
State-of-the-art approaches use a dynamic architecture based on network expansion (NE), in which a task expert is added per task.
A new NE method, dense network expansion (DNE), is proposed to achieve a better trade-off between accuracy and model complexity.
It outperforms the previous SOTA methods by a margin of 4% in terms of accuracy, with similar or even smaller model scale.
arXiv Detail & Related papers (2023-03-22T16:42:26Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - InDistill: Information flow-preserving knowledge distillation for model compression [20.88709060450944]
We introduce InDistill, a method that serves as a warmup stage for Knowledge Distillation (KD) effectiveness.<n>InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student.<n>The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets.
arXiv Detail & Related papers (2022-05-20T07:40:09Z) - Image Superresolution using Scale-Recurrent Dense Network [30.75380029218373]
Recent advances in the design of convolutional neural network (CNN) have yielded significant improvements in the performance of image super-resolution (SR)
We propose a scale recurrent SR architecture built upon units containing series of dense connections within a residual block (Residual Dense Blocks (RDBs))
Our scale recurrent design delivers competitive performance for higher scale factors while being parametrically more efficient as compared to current state-of-the-art approaches.
arXiv Detail & Related papers (2022-01-28T09:18:43Z) - SIRe-Networks: Skip Connections over Interlaced Multi-Task Learning and
Residual Connections for Structure Preserving Object Classification [28.02302915971059]
In this paper, we introduce an interlaced multi-task learning strategy, defined SIRe, to reduce the vanishing gradient in relation to the object classification task.
The presented methodology directly improves a convolutional neural network (CNN) by enforcing the input image structure preservation through auto-encoders.
To validate the presented methodology, a simple CNN and various implementations of famous networks are extended via the SIRe strategy and extensively tested on the CIFAR100 dataset.
arXiv Detail & Related papers (2021-10-06T13:54:49Z) - Learning distinct features helps, provably [98.78384185493624]
We study the diversity of the features learned by a two-layer neural network trained with the least squares loss.
We measure the diversity by the average $L$-distance between the hidden-layer features.
arXiv Detail & Related papers (2021-06-10T19:14:45Z) - Untangling tradeoffs between recurrence and self-attention in neural
networks [81.30894993852813]
We present a formal analysis of how self-attention affects gradient propagation in recurrent networks.
We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies.
We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
arXiv Detail & Related papers (2020-06-16T19:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.