Related papers: Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance

Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance

URL: http://arxiv.org/abs/2110.12484v3
Date: Tue, 2 Jul 2024 13:33:39 GMT
Title: Enabling Large Batch Size Training for DNN Models Beyond the Memory Limit While Maintaining Performance
Authors: XinYu Piao, DoangJoo Synn, JooYoung Park, Jong-Kook Kim,
Abstract summary: Recent deep learning models are difficult to train using a large batch size. Machines may not have enough memory to accommodate both the model and a large data batch size. This paper proposes a method called Micro-Batch Processing (MBP) to address this problem.
Score: 0.22499166814992438
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent deep learning models are difficult to train using a large batch size, because commodity machines may not have enough memory to accommodate both the model and a large data batch size. The batch size is one of the hyper-parameters used in the training model, and it is dependent on and is limited by the target machine memory capacity because the batch size can only fit into the remaining memory after the model is uploaded. Moreover, the data item size is also an important factor because if each data item size is larger then the batch size that can fit into the remaining memory becomes smaller. This paper proposes a method called Micro-Batch Processing (MBP) to address this problem. This method helps deep learning models to train by providing a batch processing method that splits a batch into a size that can fit in the remaining memory and processes them sequentially. After processing the small batches individually, a loss normalization algorithm based on the gradient accumulation is used to maintain the performance. The purpose of our method is to allow deep learning models to train using larger batch sizes that exceed the memory capacity of a system without increasing the memory size or using multiple devices (GPUs).

Related papers

Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z)
BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments [53.71158537264695]
Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. We introduce textbfBitStack, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance.
arXiv Detail & Related papers (2024-10-31T13:26:11Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
CompAct: Compressed Activations for Memory-Efficient LLM Training [7.837209773889032]
CompAct is a technique that reduces peak memory utilization on GPU by 25-30% for pretraining and 50% for fine-tuning of LLMs. By storing low-rank, compressed activations to be used in the backward pass we greatly reduce the required memory. We expect CompAct's savings to scale even higher for larger models.
arXiv Detail & Related papers (2024-10-20T10:24:38Z)
Vocabulary-level Memory Efficiency for Language Model Fine-tuning [36.1039389951318]
We show that a significant proportion of the vocabulary remains unused during fine-tuning. We propose a simple yet effective approach that leverages this finding to minimize memory usage. Our approach does not impact downstream task performance, while allowing more efficient use of computational resources.
arXiv Detail & Related papers (2023-09-15T19:00:00Z)
A Model or 603 Exemplars: Towards Memory-Efficient Class-Incremental Learning [56.450090618578]
Class-Incremental Learning (CIL) aims to train a model with limited memory size to meet this requirement. We show that when counting the model size into the total budget and comparing methods with aligned memory size, saving models do not consistently work. We propose a simple yet effective baseline, denoted as MEMO for Memory-efficient Expandable MOdel.
arXiv Detail & Related papers (2022-05-26T08:24:01Z)
Memory Replay with Data Compression for Continual Learning [80.95444077825852]
We propose memory replay with data compression to reduce the storage cost of old training samples. We extensively validate this across several benchmarks of class-incremental learning and in a realistic scenario of object detection for autonomous driving.
arXiv Detail & Related papers (2022-02-14T10:26:23Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint. We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z)
Semantically Constrained Memory Allocation (SCMA) for Embedding in Efficient Recommendation Systems [27.419109620575313]
A key challenge for deep learning models is to work with millions of categorical classes or tokens. We propose a novel formulation of memory shared embedding, where memory is shared in proportion to the overlap in semantic information. We demonstrate a significant reduction in the memory footprint while maintaining performance.
arXiv Detail & Related papers (2021-02-24T19:55:49Z)
Diagonal Memory Optimisation for Machine Learning on Micro-controllers [21.222568055417717]
Micro controllers and low power CPUs are increasingly being used to perform inference with machine learning models. Small amounts of RAM available on these targets sets limits on size of models which can be executed. diagonal memory optimisation technique is described and shown to achieve memory savings of up to 34.5% when applied to eleven common models.
arXiv Detail & Related papers (2020-10-04T19:45:55Z)
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training [12.36664837965624]
This paper presents an approach to automatically shard the weight update across replicas. We show this technique achieves substantial speedups on typical image and language models on Cloud TPUs.
arXiv Detail & Related papers (2020-04-28T07:13:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.