Related papers: DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training

URL: http://arxiv.org/abs/2304.08480v1
Date: Mon, 17 Apr 2023 17:58:21 GMT
Title: DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training
Authors: Yihao Chen, Xianbiao Qi, Jianan Wang, Lei Zhang
Abstract summary: DisCo-CLIP is a memory-efficient CLIP training approach. DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K.
Score: 13.953918004371493
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all_reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from $\bigO(B^2)$ to $\bigO(\frac{B^2}{N})$, where $B$ and $N$ are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. The code will be released at https://github.com/IDEA-Research/DisCo-CLIP

Related papers

Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation [7.204881999658682]
Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value ( KV) caching is used to store intermediate activations. The memory required for KV caching grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU.
arXiv Detail & Related papers (2024-11-26T04:03:14Z)
Cut Your Losses in Large-Vocabulary Language Models [102.6981011879656]
We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. CCE reduces the memory footprint of the loss from 24 GB to 1 MB, and the total training-time memory consumption of the head from 28 GB to 1 GB.
arXiv Detail & Related papers (2024-11-13T20:30:15Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
We propose a tile-based strategy that partitions the contrastive loss calculation into arbitrary small blocks. We also introduce a multi-level tiling strategy to leverage the hierarchical structure of distributed systems. Compared to SOTA memory-efficient solutions, it achieves a two-order-of-magnitude reduction in memory while maintaining comparable speed.
arXiv Detail & Related papers (2024-10-22T17:59:30Z)
Less Memory Means smaller GPUs: Backpropagation with Compressed Activations [1.7065506903618906]
The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements. Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators. With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule.
arXiv Detail & Related papers (2024-09-18T11:57:05Z)
LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss. We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality. Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z)
Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs [3.7101665559244874]
This paper presents a SYCL implementation of Multi-formedLayer Perceptrons (MLPs) for the Intel Data Center GPU Max 1550. We show with a simple model that this results in a significant increase in arithmetic intensity, leading to improved performance, especially for inference.
arXiv Detail & Related papers (2024-03-26T11:38:39Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Data-Efficient Instance Segmentation with a Single GPU [88.31338435907304]
We introduce a data-efficient segmentation method we used in the 2021 VIPriors Instance Challenge. Our solution is a modified version of Swin Transformer, based on the mmdetection which is a powerful toolbox. Our method achieved the AP@0.50:0.95 (medium) of 0.592, which ranks second among all contestants.
arXiv Detail & Related papers (2021-10-01T07:36:20Z)
Out-of-Core GPU Gradient Boosting [0.0]
We show that much larger datasets can fit on a given GPU, without degrading model accuracy or training time. This is the first out-of-core GPU implementation of gradient boosting.
arXiv Detail & Related papers (2020-05-19T00:41:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.