Related papers: DiJiang: Efficient Large Language Models through Compact Kernelization

DiJiang: Efficient Large Language Models through Compact Kernelization

URL: http://arxiv.org/abs/2403.19928v2
Date: Mon, 1 Apr 2024 09:17:01 GMT
Title: DiJiang: Efficient Large Language Models through Compact Kernelization
Authors: Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, Yunhe Wang,
Abstract summary: We present a novel Frequency Domain Kernelization approach that enables the transformation of a pre-trained vanilla Transformer into a linear complexity model with little training costs. Experiments demonstrate that the proposed method achieves comparable performance to the original Transformer, but with significantly reduced training costs and much faster inference speeds.
Score: 30.24187657746638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In an effort to reduce the computational load of Transformers, research on linear attention has gained significant momentum. However, the improvement strategies for attention mechanisms typically necessitate extensive retraining, which is impractical for large language models with a vast array of parameters. In this paper, we present DiJiang, a novel Frequency Domain Kernelization approach that enables the transformation of a pre-trained vanilla Transformer into a linear complexity model with little training costs. By employing a weighted Quasi-Monte Carlo method for sampling, the proposed approach theoretically offers superior approximation efficiency. To further reduce the training computational complexity, our kernelization is based on Discrete Cosine Transform (DCT) operations. Extensive experiments demonstrate that the proposed method achieves comparable performance to the original Transformer, but with significantly reduced training costs and much faster inference speeds. Our DiJiang-7B achieves comparable performance with LLaMA2-7B on various benchmark while requires only about 1/50 training cost. Code is available at https://github.com/YuchuanTian/DiJiang.

Related papers

Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought [46.71030329872635]
Chain of Thought (CoT) prompting has been shown to significantly improve the performance of large language models (LLMs) We study the training dynamics of transformers over a CoT objective on an in-context weight prediction task for linear regression.
arXiv Detail & Related papers (2025-02-28T16:40:38Z)
Linearizing Large Language Models [26.94551511277412]
We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models.
arXiv Detail & Related papers (2024-05-10T17:59:08Z)
PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation. PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z)
Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach [58.57026686186709]
We introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR) CFSR inherits the advantages of both convolution-based and transformer-based approaches. Experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance.
arXiv Detail & Related papers (2024-01-11T03:08:00Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Towards Compute-Optimal Transfer Learning [82.88829463290041]
We argue that zero-shot structured pruning of pretrained models allows them to increase compute efficiency with minimal reduction in performance. Our results show that pruning convolutional filters of pretrained models can lead to more than 20% performance improvement in low computational regimes.
arXiv Detail & Related papers (2023-04-25T21:49:09Z)
ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise [20.779167087445995]
Large pretrained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks. ScaLA is a novel and efficient method to accelerate the speed of transformer networks. Experiment results show that ScaLA attains 2.7-UE-9.8$times$ adaptation speedups over the baseline for GLLA on BERT-base RoBERTa-large.
arXiv Detail & Related papers (2022-01-29T01:47:01Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping [24.547833264405355]
The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline. While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
arXiv Detail & Related papers (2020-10-26T06:50:07Z)
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute. We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps. This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.