Related papers: Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers

Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers

URL: http://arxiv.org/abs/2404.02684v1
Date: Wed, 3 Apr 2024 12:27:36 GMT
Title: Cross-Architecture Transfer Learning for Linear-Cost Inference Transformers
Authors: Sehyun Choi,
Abstract summary: We propose Cross-Architecture Transfer Learning (XATL) to improve efficiency of Transformer Language Models. Methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.
Score: 1.1499643186017316
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, multiple architectures has been proposed to improve the efficiency of the Transformer Language Models through changing the design of the self-attention block to have a linear-cost inference (LCI). A notable approach in this realm is the State-Space Machines (SSMs) architecture, which showed on-par performance on language modeling tasks with the self-attention transformers. However, such an architectural change requires a full pretraining of the weights from scratch, which incurs a huge cost to researchers and practitioners who want to use the new architectures. In the more traditional linear attention works, it has been proposed to approximate full attention with linear attention by swap-and-finetune framework. Motivated by this approach, we propose Cross-Architecture Transfer Learning (XATL), in which the weights of the shared components between LCI and self-attention-based transformers, such as layernorms, MLPs, input/output embeddings, are directly transferred to the new architecture from already pre-trained model parameters. We experimented the efficacy of the method on varying sizes and alternative attention architectures and show that \methodabbr significantly reduces the training time up to 2.5x times and converges to a better minimum with up to 2.6% stronger model on the LM benchmarks within the same compute budget.

Related papers

Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
STAR: Synthesis of Tailored Architectures [61.080157488857516]
We propose a new approach for the synthesis of tailored architectures (STAR) Our approach combines a novel search space based on the theory of linear input-varying systems, supporting a hierarchical numerical encoding into architecture genomes. STAR genomes are automatically refined and recombined with gradient-free, evolutionary algorithms to optimize for multiple model quality and efficiency metrics. Using STAR, we optimize large populations of new architectures, leveraging diverse computational units and interconnection patterns, improving over highly-optimized Transformers and striped hybrid models on the frontier of quality, parameter size, and inference cache for autoregressive language modeling.
arXiv Detail & Related papers (2024-11-26T18:42:42Z)
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs) Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z)
Symmetric Dot-Product Attention for Efficient Training of BERT Language Models [5.838117137253223]
We propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation.
arXiv Detail & Related papers (2024-06-10T15:24:15Z)
Activator: GLU Activation Function as the Core Component of a Vision Transformer [1.3812010983144802]
Transformer architecture currently represents the main driver behind many successes in a variety of tasks addressed by deep learning. This paper investigates substituting the attention mechanism usually adopted for transformer architecture with incorporating linear gated unit (GLU) activation within a multi-layer perceptron architecture.
arXiv Detail & Related papers (2024-05-24T21:46:52Z)
Towards smaller, faster decoder-only transformers: Architectural variants and their implications [0.0]
We introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT, LinearGPT, and ConvGPT. These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes.
arXiv Detail & Related papers (2024-04-22T06:19:46Z)
AGaLiTe: Approximate Gated Linear Transformers for Online Reinforcement Learning [7.886461196772644]
We introduce alternatives to the transformer self-attention mechanism that offer context-independent inference cost. Compared with a state-of-the-art architecture, GTrXL, inference in our approach is at least 40% cheaper while reducing memory use more than 50%.
arXiv Detail & Related papers (2023-10-24T10:51:50Z)
Scaling Pre-trained Language Models to Deeper via Parameter-efficient Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO) MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts. Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z)
Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications. The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate. There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z)
Cross-Architecture Knowledge Distillation [32.689574589575244]
It is natural to distill complementary knowledge from Transformer to convolutional neural network (CNN) To deal with this problem, a novel cross-architecture knowledge distillation method is proposed. The proposed method outperforms 14 state-of-the-arts on both small-scale and large-scale datasets.
arXiv Detail & Related papers (2022-07-12T02:50:48Z)
Slimmable Domain Adaptation [112.19652651687402]
We introduce a simple framework, Slimmable Domain Adaptation, to improve cross-domain generalization with a weight-sharing model bank. Our framework surpasses other competing approaches by a very large margin on multiple benchmarks.
arXiv Detail & Related papers (2022-06-14T06:28:04Z)
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics. With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture. Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z)
GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture. We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions. We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.