Related papers: MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

URL: http://arxiv.org/abs/2411.12992v1
Date: Wed, 20 Nov 2024 02:41:53 GMT
Title: MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers
Authors: Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, Heng Liao, Yunhe Wang,
Abstract summary: We present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. We conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.
Score: 43.39466934693055
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and corresponding computational complexity are constantly scaled up in pursuit of higher performance. In this work, we present MemoryFormer, a novel transformer architecture which significantly reduces the computational complexity (FLOPs) from a new perspective. We eliminate nearly all the computations of the transformer model except for the necessary computation required by the multi-head attention operation. This is made possible by utilizing an alternative method for feature transformation to replace the linear projection of fully-connected layers. Specifically, we first construct a group of in-memory lookup tables that store a large amount of discrete vectors to replace the weight matrix used in linear projection. We then use a hash algorithm to retrieve a correlated subset of vectors dynamically based on the input embedding. The retrieved vectors combined together will form the output embedding, which provides an estimation of the result of matrix multiplication operation in a fully-connected layer. Compared to conducting matrix multiplication, retrieving data blocks from memory is a much cheaper operation which requires little computations. We train MemoryFormer from scratch and conduct extensive experiments on various benchmarks to demonstrate the effectiveness of the proposed model.

Related papers

LLM Inference Acceleration via Efficient Operation Fusion [1.350507740574158]
Transformer-based Large Language Models (LLMs) contain hundreds of billions of parameters and require dedicated hardware resources for training and inference. One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-linear transformations. We propose an extremely efficient technique that can completely hide the overhead caused by such collective operations.
arXiv Detail & Related papers (2025-02-24T23:42:37Z)
A Simple Sparse Matrix Vector Multiplication Approach to Padded Convolution [0.0]
We introduce an algorithm for efficiently representing convolution with zero-padding and stride as a sparse transformation matrix. We provide an explicit expression for the number of non-zero multiplications in convolutions with stride and padding, offering insight into the potential for leveraging sparsity in convolution operations.
arXiv Detail & Related papers (2024-11-29T00:14:24Z)
SMM-Conv: Scalar Matrix Multiplication with Zero Packing for Accelerated Convolution [4.14360329494344]
We present a novel approach for accelerating convolutions during inference for CPU-based architectures. Our experiments with commonly used network architectures demonstrate a significant speedup compared to existing indirect methods.
arXiv Detail & Related papers (2024-11-23T21:43:38Z)
Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time. By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling [1.0128808054306186]
We propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method. Our sparsification approach reduces the computational complexity and memory footprint of the Transformer during training. New SPION achieves up to 3.08X speedup over existing state-of-the-art sparse Transformer models.
arXiv Detail & Related papers (2023-09-22T02:14:46Z)
RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z)
Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation. MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation. We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z)
Memory-Efficient Backpropagation through Large Linear Layers [107.20037639738433]
In modern neural networks like Transformers, linear layers require significant memory to store activations during backward pass. This study proposes a memory reduction approach to perform backpropagation through linear layers.
arXiv Detail & Related papers (2022-01-31T13:02:41Z)
Sketching Transformed Matrices with Applications to Natural Language Processing [76.6222695417524]
We propose a space-efficient sketching algorithm for computing the product of a given small matrix with the transformed matrix. We show that our approach obtains small error and is efficient in both space and time.
arXiv Detail & Related papers (2020-02-23T03:07:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.