Related papers: Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models

Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models

URL: http://arxiv.org/abs/2412.16545v1
Date: Sat, 21 Dec 2024 09:04:51 GMT
Title: Attention Entropy is a Key Factor: An Analysis of Parallel Context Encoding with Full-attention-based Pre-trained Language Models
Authors: Zhisong Zhang, Yan Wang, Xinting Huang, Tianqing Fang, Hongming Zhang, Chenlong Deng, Shuaiyi Li, Dong Yu,
Abstract summary: Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling.<n>The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers.<n>We propose parallel context encoding, which splits the context into sub-pieces and encodes them parallelly.
Score: 49.84163262868945
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have shown remarkable performance across a wide range of language tasks, owing to their exceptional capabilities in context modeling. The most commonly used method of context modeling is full self-attention, as seen in standard decoder-only Transformers. Although powerful, this method can be inefficient for long sequences and may overlook inherent input structures. To address these problems, an alternative approach is parallel context encoding, which splits the context into sub-pieces and encodes them parallelly. Because parallel patterns are not encountered during training, naively applying parallel encoding leads to performance degradation. However, the underlying reasons and potential mitigations are unclear. In this work, we provide a detailed analysis of this issue and identify that unusually high attention entropy can be a key factor. Furthermore, we adopt two straightforward methods to reduce attention entropy by incorporating attention sinks and selective mechanisms. Experiments on various tasks reveal that these methods effectively lower irregular attention entropy and narrow performance gaps. We hope this study can illuminate ways to enhance context modeling mechanisms.

Related papers

Efficient Attention Mechanisms for Large Language Models: A Survey [18.86171225316892]
Transformer-based architectures have become the prevailing computation backbone of large language models.<n>Recent research has introduced two principal categories of efficient attention mechanisms.<n>Sparse attention techniques, in contrast, limit attention to selected subsets of tokens based on fixed patterns, block-wise routing, or clustering strategies.
arXiv Detail & Related papers (2025-07-25T18:08:10Z)
Attention, Please! Revisiting Attentive Probing for Masked Image Modeling [20.39513629593113]
We introduce efficient probing (EP), a cross-attention mechanism that eliminates redundant projections, reduces the number of trainable parameters, and achieves up to a 10$times$ speed-up over conventional multi-head attention.<n>EP generalizes well beyond MIM to diverse pre-training paradigms, produces interpretable attention maps, and achieves strong gains in low-shot and layer-wise settings.
arXiv Detail & Related papers (2025-06-11T21:10:26Z)
Curse of High Dimensionality Issue in Transformer for Long-context Modeling [31.257769500741006]
We propose textitDynamic Group Attention (DGA) to reduce redundancy by aggregating less important tokens during attention computation.<n>Our results show that our DGA significantly reduces computational costs while maintaining competitive performance.
arXiv Detail & Related papers (2025-05-28T08:34:46Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive. LCD can distort the global distribution over strings, sampling tokens based only on local information. We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Decomposable Transformer Point Processes [2.1756081703276]
We propose a framework where the advantages of the attention-based architecture are maintained and the limitation of the thinning algorithm is circumvented. The proposed method attains state-of-the-art performance in predicting the next event of a sequence given its history.
arXiv Detail & Related papers (2024-09-26T13:22:58Z)
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs) We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z)
On the Interpretability of Attention Networks [1.299941371793082]
We show how an attention model can be accurate but fail to be interpretable, and show that such models do occur as a result of training. We evaluate a few attention model learning algorithms designed to encourage sparsity and demonstrate that these algorithms help improve interpretability.
arXiv Detail & Related papers (2022-12-30T15:31:22Z)
Object Representations as Fixed Points: Training Iterative Refinement Algorithms with Implicit Differentiation [88.14365009076907]
Iterative refinement is a useful paradigm for representation learning. We develop an implicit differentiation approach that improves the stability and tractability of training.
arXiv Detail & Related papers (2022-07-02T10:00:35Z)
Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation [0.0]
Multiformer is a Transformer-based model which allows the use of different attention mechanisms on each head. By doing this, the model is able to bias the self-attention towards the extraction of more diverse token interactions. Results show that mixing attention patterns along the different heads and layers outperforms our baseline by up to 0.7 BLEU.
arXiv Detail & Related papers (2022-05-14T17:37:47Z)
Deep Equilibrium Assisted Block Sparse Coding of Inter-dependent Signals: Application to Hyperspectral Imaging [71.57324258813675]
A dataset of inter-dependent signals is defined as a matrix whose columns demonstrate strong dependencies. A neural network is employed to act as structure prior and reveal the underlying signal interdependencies. Deep unrolling and Deep equilibrium based algorithms are developed, forming highly interpretable and concise deep-learning-based architectures.
arXiv Detail & Related papers (2022-03-29T21:00:39Z)
Attention vs non-attention for a Shapley-based explanation method [6.386917828177479]
We consider Contextual Decomposition (CD) -- a Shapley-based input feature attribution method that has been shown to work well for recurrent NLP models. We show that the English and Dutch models demonstrate similar processing behaviour, but that under the hood there are consistent differences between our attention and non-attention models.
arXiv Detail & Related papers (2021-04-26T09:33:18Z)
Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize. Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.