Related papers: Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

URL: http://arxiv.org/abs/2002.05645v5
Date: Fri, 5 Jun 2020 03:00:26 GMT
Title: Training Large Neural Networks with Constant Memory using a New Execution Algorithm
Authors: Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, and Sujeeth Bharadwaj
Abstract summary: We introduce a new relay-style execution technique called L2L (layer-to-layer) L2L is able to fit models up to 50 Billion parameters on a machine with a single 16GB V100 and 512GB CPU memory.
Score: 0.5424799109837065
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Widely popular transformer-based NLP models such as BERT and Turing-NLG have enormous capacity trending to billions of parameters. Current execution methods demand brute-force resources such as HBM devices and high speed interconnectivity for data parallelism. In this paper, we introduce a new relay-style execution technique called L2L (layer-to-layer) where at any given moment, the device memory is primarily populated only with the executing layer(s)'s footprint. The model resides in the DRAM memory attached to either a CPU or an FPGA as an entity we call eager param-server (EPS). To overcome the bandwidth issues of shuttling parameters to and from EPS, the model is executed a layer at a time across many micro-batches instead of the conventional method of minibatches over whole model. L2L is implemented using 16GB V100 devices for BERT-Large running it with a device batch size of up to 256. Our results show 45% reduction in memory and 40% increase in the throughput compared to the state-of-the-art baseline. L2L is also able to fit models up to 50 Billion parameters on a machine with a single 16GB V100 and 512GB CPU memory and without requiring any model partitioning. L2L scales to arbitrary depth allowing researchers to develop on affordable devices which is a big step toward democratizing AI. By running the optimizer in the host EPS, we show a new form of mixed precision for faster throughput and convergence. In addition, the EPS enables dynamic neural architecture approaches by varying layers across iterations. Finally, we also propose and demonstrate a constant memory variation of L2L and we propose future enhancements. This work has been performed on GPUs first, but also targeted towards all high TFLOPS/Watt accelerators.

Related papers

Photonic Fabric Platform for AI Accelerators [0.844067337858849]
Photonic Fabric Appliance (PFA) is a photonic-enabled switch and memory subsystem that delivers low latency, high bandwidth, and low per-bit energy.<n>PFA offers up to 32 TB of shared memory alongside 115 Tbps of all-to-all digital switching.<n>Replaces a local stack on an XPU with a chiplet that connects to the Photonic Fabric.
arXiv Detail & Related papers (2025-07-18T15:14:56Z)
Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.45368843861917]
We introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers.<n>We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs to share memory readout states from a Samba-based self-decoder.
arXiv Detail & Related papers (2025-07-09T07:27:00Z)
DistZO2: High-Throughput and Memory-Efficient Zeroth-Order Fine-tuning LLMs with Distributed Parallel Computing [4.589472292598182]
Fine-tuning large language models (LLMs) remains resource-intensive due to their sheer scale.<n>We present DistZO2, a memory-efficient framework for distributed zeroth-order fine-tuning of LLMs.
arXiv Detail & Related papers (2025-07-03T22:53:34Z)
ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory [29.245719403159615]
We propose a novel framework, ZO2, for efficient zeroth-order fine-tuning of LLMs with only limited GPU memory. Our framework supports an innovative low-bit precision approach in AMP mode to streamline data exchanges between the CPU and GPU.
arXiv Detail & Related papers (2025-03-16T21:58:29Z)
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale [68.6602625868888]
We introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. Operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression. We train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids.
arXiv Detail & Related papers (2025-02-25T19:47:20Z)
Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers [65.35142508909892]
We present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
arXiv Detail & Related papers (2025-02-12T06:05:52Z)
Ultra-Sparse Memory Network [8.927205198458994]
This work introduces UltraMem, incorporating large-scale, ultra-sparse memory layer to address these limitations. We show that our method achieves state-of-the-art inference speed and model performance within a given computational budget.
arXiv Detail & Related papers (2024-11-19T09:24:34Z)
AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z)
FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing [10.47214968497857]
We present high-performance methods that exploit low-rank structures to pretrain and finetune large language models. Our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop. For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks.
arXiv Detail & Related papers (2024-02-21T05:03:17Z)
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
LiteTransformerSearch: Training-free On-device Search for Efficient Autoregressive Language Models [34.673688610935876]
We show that the latency and perplexity pareto-frontier can be found without need for any model training. We evaluate our method, dubbed Lightweight Transformer Search (LTS), on diverse devices. We show that the perplexity of Transformer-XL can be achieved with up to 2x lower latency.
arXiv Detail & Related papers (2022-03-04T02:10:43Z)
A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle. We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint. We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z)
Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters. It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions. We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z)
Scaling Distributed Deep Learning Workloads beyond the Memory Capacity with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods. We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.