Related papers: Value Residual Learning For Alleviating Attention Concentration In Transformers

Value Residual Learning For Alleviating Attention Concentration In Transformers

URL: http://arxiv.org/abs/2410.17897v2
Date: Thu, 14 Nov 2024 17:46:04 GMT
Title: Value Residual Learning For Alleviating Attention Concentration In Transformers
Authors: Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan,
Abstract summary: stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. We propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers.
Score: 14.898656879574622
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the $KV$ cache by nearly 50\%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. Further visualization results suggest that Resformer alleviates attention sinks through avoiding value-state drains. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

Related papers

KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding [72.12756830560217]
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI.<n>Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value cache during inference has emerged as a primary efficiency bottleneck.<n>By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed.
arXiv Detail & Related papers (2025-07-15T12:52:12Z)
SURGEON: Memory-Adaptive Fully Test-Time Adaptation via Dynamic Activation Sparsity [30.260783715373382]
Test-time adaptation (TTA) has emerged to improve the performance of deep models by adapting them to unlabeled target data online. Yet, the significant memory cost, particularly in resource-constrained terminals, impedes the effective deployment of most backward-propagation-based TTA methods. To tackle memory constraints, we introduce SURGEON, a method that substantially reduces memory cost while preserving comparable accuracy improvements.
arXiv Detail & Related papers (2025-03-26T09:27:09Z)
AdaSVD: Adaptive Singular Value Decomposition for Large Language Models [84.60646883395454]
Singular Value Decomposition (SVD) has emerged as a promising compression technique for large language models (LLMs) Existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation. We propose AdaSVD, an adaptive SVD-based LLM compression approach.
arXiv Detail & Related papers (2025-02-03T14:34:37Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [65.36715026409873]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
What Matters in Transformers? Not All Attention is Needed [7.857824255138334]
Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks. It also introduces redundant architectures, posing efficiency challenges for real-world deployment. We investigate redundancy across different modules within Transformers, including Blocks, Attention layers, using a similarity-based metric.
arXiv Detail & Related papers (2024-06-22T08:41:48Z)
MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z)
CR-SAM: Curvature Regularized Sharpness-Aware Minimization [8.248964912483912]
Sharpness-Aware Minimization (SAM) aims to enhance the generalizability by minimizing worst-case loss using one-step gradient ascent as an approximation. In this paper, we introduce a normalized Hessian trace to accurately measure the curvature of loss landscape on em both training and test sets. In particular, to counter excessive non-linearity of loss landscape, we propose Curvature Regularized SAM (CR-SAM)
arXiv Detail & Related papers (2023-12-21T03:46:29Z)
Adaptive Cross Batch Normalization for Metric Learning [75.91093210956116]
Metric learning is a fundamental problem in computer vision. We show that it is equally important to ensure that the accumulated embeddings are up to date. In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration.
arXiv Detail & Related papers (2023-03-30T03:22:52Z)
EcoTTA: Memory-Efficient Continual Test-time Adaptation via Self-distilled Regularization [71.70414291057332]
TTA may primarily be conducted on edge devices with limited memory. Long-term adaptation often leads to catastrophic forgetting and error accumulation. We present lightweight meta networks that can adapt the frozen original networks to the target domain.
arXiv Detail & Related papers (2023-03-03T13:05:30Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
DIVISION: Memory Efficient Training via Dual Activation Precision [60.153754740511864]
State-of-the-art work combines a search of quantization bit-width with the training, which makes the procedure complicated and less transparent. We propose a simple and effective method to compress DNN training. Experiment results show DIVISION has better comprehensive performance than state-of-the-art methods, including over 10x compression of activation maps and competitive training throughput, without loss of model accuracy.
arXiv Detail & Related papers (2022-08-05T03:15:28Z)
Regularizing Variational Autoencoder with Diversity and Uncertainty Awareness [61.827054365139645]
Variational Autoencoder (VAE) approximates the posterior of latent variables based on amortized variational inference. We propose an alternative model, DU-VAE, for learning a more Diverse and less Uncertain latent space.
arXiv Detail & Related papers (2021-10-24T07:58:13Z)
Impact of Channel Variation on One-Class Learning for Spoof Detection [5.549602650463701]
Spoofing detection increases the reliability of the ASV system but degrades significantly due to channel variation. "Which data-feeding strategy is optimal for MCT?" is not known in the case of spoof detection. This study highlights the relevance of the deemed-of-low-importance process of data-feeding and mini-batching to raise awareness of the need to refine it for better performance.
arXiv Detail & Related papers (2021-09-30T07:56:16Z)
Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection. With the global view in very shallow layers, the transformer encoder preserves more local representations. Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z)
Video Super-Resolution Transformer [85.11270760456826]
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. In this paper, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information.
arXiv Detail & Related papers (2021-06-12T20:00:32Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
Train your classifier first: Cascade Neural Networks Training from upper layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers. We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks. The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z)
Layer Pruning via Fusible Residual Convolutional Block for Deep Neural Networks [15.64167076052513]
layer pruning has less inference time and runtime memory usage when the same FLOPs and number of parameters are pruned. We propose a simple layer pruning method using residual convolutional block (ResConv) Our pruning method achieves excellent performance of compression and acceleration over the state-thearts on different datasets.
arXiv Detail & Related papers (2020-11-29T12:51:16Z)
DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering [22.178201429268103]
Transformer-based QA models use input-wide self-attention across both the question and the input passage. We introduce DeFormer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. We show DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and with simple distillation-based losses they incur only a 1% drop in accuracy.
arXiv Detail & Related papers (2020-05-02T04:28:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.