Related papers: Adaptive Large Language Models By Layerwise Attention Shortcuts

Adaptive Large Language Models By Layerwise Attention Shortcuts

URL: http://arxiv.org/abs/2409.10870v1
Date: Tue, 17 Sep 2024 03:46:01 GMT
Title: Adaptive Large Language Models By Layerwise Attention Shortcuts
Authors: Prateek Verma, Mert Pilanci,
Abstract summary: LLM-like setups allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture.
Score: 46.76681147411957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we propose to challenge this and introduce adaptive computations for LLM-like setups, which allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism, thereby introducing computational \textbf{attention shortcuts}. These shortcuts can thus make the architecture depth and context adaptive. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture. We give evidence via attention maps that the models learn complex dependencies across layers that are adaptive in context and depth depending on the input tokens.

Related papers

Hierarchical and Step-Layer-Wise Tuning of Attention Specialty for Multi-Instance Synthesis in Diffusion Transformers [22.269573676129152]
Text-to-image (T2I) generation models often struggle with multi-instance synthesis (MIS) Traditional MIS control methods for UNet architectures fail to adapt to DiT-based models. We propose a training-free approach for enhancing MIS in DiT-based models.
arXiv Detail & Related papers (2025-04-14T11:59:58Z)
Learning and Evaluating Hierarchical Feature Representations [3.770103075126785]
We propose a novel framework, Hierarchical Composition of Orthogonal Subspaces (Hier-COS) Hier-COS learns to map deep feature embeddings into a vector space that is, by design, consistent with the structure of a given taxonomy tree. We demonstrate that Hier-COS achieves state-of-the-art hierarchical performance across all the datasets while simultaneously beating top-1 accuracy in all but one case.
arXiv Detail & Related papers (2025-03-10T20:59:41Z)
How transformers learn structured data: insights from hierarchical filtering [2.7784685368355744]
We introduce a hierarchical filtering procedure for generative models of sequences on trees. We provide evidence that vanilla encoder-only transformers can approximate the exact inference algorithm when trained on root classification. We find clear evidence of a reconstruction of correlations on successive length scales corresponding to the various levels of the hierarchy.
arXiv Detail & Related papers (2024-08-27T15:23:09Z)
Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling [4.190836962132713]
This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its conditioned kernel on input sequence. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality.
arXiv Detail & Related papers (2024-02-28T17:36:45Z)
Language Models as Hierarchy Encoders [22.03504018330068]
We introduce a novel approach to re-train transformer encoder-based LMs as Hierarchy Transformer encoders (HiTs) Our method situates the output embedding space of pre-trained LMs within a Poincar'e ball with a curvature that adapts to the embedding dimension. We evaluate HiTs against pre-trained LMs, standard fine-tuned LMs, and several hyperbolic embedding baselines.
arXiv Detail & Related papers (2024-01-21T02:29:12Z)
Equivariant Architectures for Learning in Deep Weight Spaces [54.61765488960555]
We present a novel network architecture for learning in deep weight spaces. It takes as input a concatenation of weights and biases of a pre-trainedvariant. We show how these layers can be implemented using three basic operations.
arXiv Detail & Related papers (2023-01-30T10:50:33Z)
Adaptively Clustering Neighbor Elements for Image-Text Generation [78.82346492527425]
We propose a novel Transformer-based image-to-text generation model termed as textbfACF. ACF adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments. Experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models.
arXiv Detail & Related papers (2023-01-05T08:37:36Z)
Augmenting Convolutional networks with attention-based aggregation [55.97184767391253]
We show how to augment any convolutional network with an attention-based global map to achieve non-local reasoning. We plug this learned aggregation layer with a simplistic patch-based convolutional network parametrized by 2 parameters (width and depth) It yields surprisingly competitive trade-offs between accuracy and complexity, in particular in terms of memory consumption.
arXiv Detail & Related papers (2021-12-27T14:05:41Z)
Differentiable Architecture Pruning for Transfer Learning [6.935731409563879]
We propose a gradient-based approach for extracting sub-architectures from a given large model. Our architecture-pruning scheme produces transferable new structures that can be successfully retrained to solve different tasks. We provide theoretical convergence guarantees and validate the proposed transfer-learning strategy on real data.
arXiv Detail & Related papers (2021-07-07T17:44:59Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.