Related papers: AI and Memory Wall

AI and Memory Wall

URL: http://arxiv.org/abs/2403.14123v1
Date: Thu, 21 Mar 2024 04:31:59 GMT
Title: AI and Memory Wall
Authors: Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer,
Abstract summary: We show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
Score: 81.06494558184049
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.

Related papers

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient [52.96232442322824]
Collaborative Decoding (CoDe) is a novel efficient decoding strategy tailored for the Visual Auto-Regressive ( VAR) framework. CoDe capitalizes on two critical observations: the substantially reduced parameter demands at larger scales and the exclusive generation patterns across different scales. CoDe achieves a 1.7x speedup, slashes memory usage by around 50%, and preserves image quality with only a negligible FID increase from 1.95 to 1.98.
arXiv Detail & Related papers (2024-11-26T15:13:15Z)
Ultra-Sparse Memory Network [8.927205198458994]
This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. We show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.
arXiv Detail & Related papers (2024-11-19T09:24:34Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale [20.558091867632445]
DeepSpeed Inference is a comprehensive system solution for transformer model inference. It reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over $50%$ of A6000 peak)
arXiv Detail & Related papers (2022-06-30T18:01:08Z)
On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z)
A Co-design view of Compute in-Memory with Non-Volatile Elements for Neural Networks [12.042322495445196]
We discuss how compute-in-memory can play an important part in the next generation of computing hardware. A non-volatile memory based cross-bar architecture forms the heart of an engine that uses an analog process to parallelize the matrix vector multiplication operation. The cross-bar architecture, at times referred to as a neuromorphic approach, can be a key hardware element in future computing machines.
arXiv Detail & Related papers (2022-06-03T15:59:46Z)
LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models [34.673688610935876]
We show that the latency and perplexity pareto-frontier can be found without need for any model training. We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices. We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
arXiv Detail & Related papers (2022-03-04T02:10:43Z)
Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers [13.620650014358413]
Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade. One of the main challenges for researchers who might have access to only limited resources is limited memory capacity compared to model size.
arXiv Detail & Related papers (2022-02-02T22:16:27Z)
Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling. Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z)
Training Large Neural Networks with Constant Memory using a New Execution Algorithm [0.5424799109837065]
We introduce a new relay-style execution technique called L2L (layer-to-layer) L2L is able to fit models up to 50 Billion parameters on a machine with a single 16GB V100 and 512GB CPU memory.
arXiv Detail & Related papers (2020-02-13T17:29:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.