Large-Scale Training System for 100-Million Classification at Alibaba
- URL: http://arxiv.org/abs/2102.06025v1
- Date: Tue, 9 Feb 2021 06:53:31 GMT
- Title: Large-Scale Training System for 100-Million Classification at Alibaba
- Authors: Liuyihan Song and Pan Pan and Kang Zhao and Hao Yang and Yiming Chen
and Yingya Zhang and Yinghui Xu and Rong Jin
- Abstract summary: extreme classification has become an essential topic for deep learning.
It is very challenging to train a deep model with millions of classes due to the memory and explosion in the last output layer.
We build a hybrid parallel training framework to make the training process feasible.
Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs.
- Score: 43.58719630882661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the last decades, extreme classification has become an essential topic for
deep learning. It has achieved great success in many areas, especially in
computer vision and natural language processing (NLP). However, it is very
challenging to train a deep model with millions of classes due to the memory
and computation explosion in the last output layer. In this paper, we propose a
large-scale training system to address these challenges. First, we build a
hybrid parallel training framework to make the training process feasible.
Second, we propose a novel softmax variation named KNN softmax, which reduces
both the GPU memory consumption and computation costs and improves the
throughput of training. Then, to eliminate the communication overhead, we
propose a new overlapping pipeline and a gradient sparsification method.
Furthermore, we design a fast continuous convergence strategy to reduce total
training iterations by adaptively adjusting learning rate and updating model
parameters. With the help of all the proposed methods, we gain 3.9$\times$
throughput of our training system and reduce almost 60\% of training
iterations. The experimental results show that using an in-house 256 GPUs
cluster, we could train a classifier of 100 million classes on Alibaba Retail
Product Dataset in about five days while achieving a comparable accuracy with
the naive softmax training process.
Related papers
- Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - BLoad: Enhancing Neural Network Training with Efficient Sequential Data Handling [8.859850475075238]
We propose a novel training scheme that enables efficient distributed data-parallel training on sequences of different sizes with minimal overhead.
By using this scheme we were able to reduce the padding amount by more than 100$x$ while not deleting a single frame, resulting in an overall increased performance on both training time and Recall.
arXiv Detail & Related papers (2023-10-16T23:14:56Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Training Efficiency and Robustness in Deep Learning [2.6451769337566406]
We study approaches to improve the training efficiency and robustness of deep learning models.
We find that prioritizing learning on more informative training data increases convergence speed and improves generalization performance on test data.
We show that a redundancy-aware modification to the sampling of training data improves the training speed and develops an efficient method for detecting the diversity of training signal.
arXiv Detail & Related papers (2021-12-02T17:11:33Z) - Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to
100 Trillion Parameters [36.1028179125367]
Deep learning models have dominated the current landscape of production recommender systems.
Recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters.
However, the training of such models is challenging even within industrial scale data centers.
arXiv Detail & Related papers (2021-11-10T19:40:25Z) - ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning.
ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models.
Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z) - Layered gradient accumulation and modular pipeline parallelism: fast and
efficient training of large language models [0.0]
We analyse the shortest possible training time for different configurations of distributed training.
We introduce two new methods, textitlayered gradient accumulation and textitmodular pipeline parallelism, which together cut the shortest training time by half.
arXiv Detail & Related papers (2021-06-04T19:21:49Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.