Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals
- URL: http://arxiv.org/abs/2503.06473v3
- Date: Sat, 22 Mar 2025 12:05:30 GMT
- Title: Enhancing Layer Attention Efficiency through Pruning Redundant Retrievals
- Authors: Hanze Li, Xiande Huang,
- Abstract summary: We propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers.<n>We also introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers.<n>Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30% reduction in training time.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Growing evidence suggests that layer attention mechanisms, which enhance interaction among layers in deep neural networks, have significantly advanced network architectures. However, existing layer attention methods suffer from redundancy, as attention weights learned by adjacent layers often become highly similar. This redundancy causes multiple layers to extract nearly identical features, reducing the model's representational capacity and increasing training time. To address this issue, we propose a novel approach to quantify redundancy by leveraging the Kullback-Leibler (KL) divergence between adjacent layers. Additionally, we introduce an Enhanced Beta Quantile Mapping (EBQM) method that accurately identifies and skips redundant layers, thereby maintaining model stability. Our proposed Efficient Layer Attention (ELA) architecture, improves both training efficiency and overall performance, achieving a 30\% reduction in training time while enhancing performance in tasks such as image classification and object detection.
Related papers
- Strengthening Layer Interaction via Dynamic Layer Attention [12.341997220052486]
Existing layer attention methods achieve layer interaction on fixed feature maps in a static manner.
To restore the dynamic context representation capability of the attention mechanism, we propose a Dynamic Layer Attention architecture.
Experimental results demonstrate the effectiveness of the proposed DLA architecture, outperforming other state-of-the-art methods in image recognition and object detection tasks.
arXiv Detail & Related papers (2024-06-19T09:35:14Z) - Learning Sparse Neural Networks with Identity Layers [33.11654855515443]
We investigate the intrinsic link between network sparsity and interlayer feature similarity.
We propose a plug-and-play CKA-based Sparsity Regularization for sparse network training, dubbed CKA-SR.
We find that CKA-SR consistently improves the performance of several State-Of-The-Art sparse training methods.
arXiv Detail & Related papers (2023-07-14T14:58:44Z) - Centered Self-Attention Layers [89.21791761168032]
The self-attention mechanism in transformers and the message-passing mechanism in graph neural networks are repeatedly applied.
We show that this application inevitably leads to oversmoothing, i.e., to similar representations at the deeper layers.
We present a correction term to the aggregating operator of these mechanisms.
arXiv Detail & Related papers (2023-06-02T15:19:08Z) - Sharpness-Aware Minimization Leads to Low-Rank Features [49.64754316927016]
Sharpness-aware minimization (SAM) is a recently proposed method that minimizes the training loss of a neural network.
We show that SAM reduces the feature rank which happens at different layers of a neural network.
We confirm this effect theoretically and check that it can also occur in deep networks.
arXiv Detail & Related papers (2023-05-25T17:46:53Z) - Dense Network Expansion for Class Incremental Learning [61.00081795200547]
State-of-the-art approaches use a dynamic architecture based on network expansion (NE), in which a task expert is added per task.
A new NE method, dense network expansion (DNE), is proposed to achieve a better trade-off between accuracy and model complexity.
It outperforms the previous SOTA methods by a margin of 4% in terms of accuracy, with similar or even smaller model scale.
arXiv Detail & Related papers (2023-03-22T16:42:26Z) - Masked Image Modeling with Local Multi-Scale Reconstruction [54.91442074100597]
Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning.
Existing MIM models conduct reconstruction task only at the top layer of encoder.
We design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively.
arXiv Detail & Related papers (2023-03-09T13:42:04Z) - A Generic Shared Attention Mechanism for Various Backbone Neural Networks [53.36677373145012]
Self-attention modules (SAMs) produce strongly correlated attention maps across different layers.
Dense-and-Implicit Attention (DIA) shares SAMs across layers and employs a long short-term memory module.
Our simple yet effective DIA can consistently enhance various network backbones.
arXiv Detail & Related papers (2022-10-27T13:24:08Z) - InDistill: Information flow-preserving knowledge distillation for model compression [20.88709060450944]
We introduce InDistill, a method that serves as a warmup stage for Knowledge Distillation (KD) effectiveness.<n>InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student.<n>The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets.
arXiv Detail & Related papers (2022-05-20T07:40:09Z) - Image Superresolution using Scale-Recurrent Dense Network [30.75380029218373]
Recent advances in the design of convolutional neural network (CNN) have yielded significant improvements in the performance of image super-resolution (SR)
We propose a scale recurrent SR architecture built upon units containing series of dense connections within a residual block (Residual Dense Blocks (RDBs))
Our scale recurrent design delivers competitive performance for higher scale factors while being parametrically more efficient as compared to current state-of-the-art approaches.
arXiv Detail & Related papers (2022-01-28T09:18:43Z) - SIRe-Networks: Skip Connections over Interlaced Multi-Task Learning and
Residual Connections for Structure Preserving Object Classification [28.02302915971059]
In this paper, we introduce an interlaced multi-task learning strategy, defined SIRe, to reduce the vanishing gradient in relation to the object classification task.
The presented methodology directly improves a convolutional neural network (CNN) by enforcing the input image structure preservation through auto-encoders.
To validate the presented methodology, a simple CNN and various implementations of famous networks are extended via the SIRe strategy and extensively tested on the CIFAR100 dataset.
arXiv Detail & Related papers (2021-10-06T13:54:49Z) - Learning distinct features helps, provably [98.78384185493624]
We study the diversity of the features learned by a two-layer neural network trained with the least squares loss.
We measure the diversity by the average $L$-distance between the hidden-layer features.
arXiv Detail & Related papers (2021-06-10T19:14:45Z) - Untangling tradeoffs between recurrence and self-attention in neural
networks [81.30894993852813]
We present a formal analysis of how self-attention affects gradient propagation in recurrent networks.
We prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies.
We propose a relevancy screening mechanism that allows for a scalable use of sparse self-attention with recurrence.
arXiv Detail & Related papers (2020-06-16T19:24:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.