DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP
Training
- URL: http://arxiv.org/abs/2304.08480v1
- Date: Mon, 17 Apr 2023 17:58:21 GMT
- Title: DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP
Training
- Authors: Yihao Chen, Xianbiao Qi, Jianan Wang, Lei Zhang
- Abstract summary: DisCo-CLIP is a memory-efficient CLIP training approach.
DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K.
- Score: 13.953918004371493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach,
to reduce the memory consumption of contrastive loss when training contrastive
learning models. Our approach decomposes the contrastive loss and its gradient
computation into two parts, one to calculate the intra-GPU gradients and the
other to compute the inter-GPU gradients. According to our decomposition, only
the intra-GPU gradients are computed on the current GPU, while the inter-GPU
gradients are collected via all_reduce from other GPUs instead of being
repeatedly computed on every GPU. In this way, we can reduce the GPU memory
consumption of contrastive loss computation from $\bigO(B^2)$ to
$\bigO(\frac{B^2}{N})$, where $B$ and $N$ are the batch size and the number of
GPUs used for training. Such a distributed solution is mathematically
equivalent to the original non-distributed contrastive loss computation,
without sacrificing any computation accuracy. It is particularly efficient for
large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive
training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64
A100 40GB GPUs, compared with the original CLIP solution which requires 128
A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. The code
will be released at https://github.com/IDEA-Research/DisCo-CLIP
Related papers
- LoCo: Low-Bit Communication Adaptor for Large-scale Model Training [63.040522637816906]
Low-bit communication often degrades training quality due to compression information loss.
We propose Low-bit Communication Adaptor (LoCo), which compensates local local GPU nodes before, without compromising quality.
Experimental results show that across moving large-scale training model frameworks like Megatron-LM and PyTorchs FSDP, LoCo significantly improves compression communication efficiency.
arXiv Detail & Related papers (2024-07-05T13:01:36Z) - LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs [4.536118764799076]
Fine-tuning pre-trained large language models with limited hardware presents challenges due to GPU memory constraints.
We introduce LLMem, a solution that estimates the GPU memory consumption when applying distributed fine-tuning methods.
We show that LLMem accurately estimates peak GPU memory usage on a single GPU, with error rates of up to 1.6%.
arXiv Detail & Related papers (2024-04-16T22:11:35Z) - Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs [3.7101665559244874]
This paper presents a SYCL implementation of Multi-formedLayer Perceptrons (MLPs) for the Intel Data Center GPU Max 1550.
We show with a simple model that this results in a significant increase in arithmetic intensity, leading to improved performance, especially for inference.
arXiv Detail & Related papers (2024-03-26T11:38:39Z) - DistTGL: Distributed Memory-Based Temporal Graph Neural Network Training [18.52206409432894]
DistTGL is an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
arXiv Detail & Related papers (2023-07-14T22:52:27Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training.
We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z) - Data-Efficient Instance Segmentation with a Single GPU [88.31338435907304]
We introduce a data-efficient segmentation method we used in the 2021 VIPriors Instance Challenge.
Our solution is a modified version of Swin Transformer, based on the mmdetection which is a powerful toolbox.
Our method achieved the AP@0.50:0.95 (medium) of 0.592, which ranks second among all contestants.
arXiv Detail & Related papers (2021-10-01T07:36:20Z) - Efficient and Generic 1D Dilated Convolution Layer for Deep Learning [52.899995651639436]
We introduce our efficient implementation of a generic 1D convolution layer covering a wide range of parameters.
It is optimized for x86 CPU architectures, in particular, for architectures containing Intel AVX-512 and AVX-512 BFloat16 instructions.
We demonstrate the performance of our optimized 1D convolution layer by utilizing it in the end-to-end neural network training with real genomics datasets.
arXiv Detail & Related papers (2021-04-16T09:54:30Z) - Scaling Semantic Segmentation Beyond 1K Classes on a Single GPU [87.48110331544885]
We propose a novel training methodology to train and scale the existing semantic segmentation models.
We demonstrate a clear benefit of our approach on a dataset with 1284 classes, bootstrapped from LVIS and COCO annotations, with three times better mIoU than the DeeplabV3+ model.
arXiv Detail & Related papers (2020-12-14T13:12:38Z) - Out-of-Core GPU Gradient Boosting [0.0]
We show that much larger datasets can fit on a given GPU, without degrading model accuracy or training time.
This is the first out-of-core GPU implementation of gradient boosting.
arXiv Detail & Related papers (2020-05-19T00:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.