CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10
minutes on 1 GPU
- URL: http://arxiv.org/abs/2204.06240v1
- Date: Wed, 13 Apr 2022 08:17:15 GMT
- Title: CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10
minutes on 1 GPU
- Authors: Zangwei Zheng, Pengtai Xu, Xuan Zou, Da Tang, Zhen Li, Chenguang Xi,
Peng Wu, Leqi Zou, Yijie Zhu, Ming Chen, Xiangzhuo Ding, Fuzhao Xue, Ziheng
Qing, Youlong Cheng, Yang You
- Abstract summary: A click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item.
One approach to increase the training speed is to apply large batch training.
We develop the adaptive Column-wise Clipping (CowClip) to stabilize the training process in a large batch size setting.
- Score: 14.764217935910988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The click-through rate (CTR) prediction task is to predict whether a user
will click on the recommended item. As mind-boggling amounts of data are
produced online daily, accelerating CTR prediction model training is critical
to ensuring an up-to-date model and reducing the training cost. One approach to
increase the training speed is to apply large batch training. However, as shown
in computer vision and natural language processing tasks, training with a large
batch easily suffers from the loss of accuracy. Our experiments show that
previous scaling rules fail in the training of CTR prediction neural networks.
To tackle this problem, we first theoretically show that different frequencies
of ids make it challenging to scale hyperparameters when scaling the batch
size. To stabilize the training process in a large batch size setting, we
develop the adaptive Column-wise Clipping (CowClip). It enables an easy and
effective scaling rule for the embeddings, which keeps the learning rate
unchanged and scales the L2 loss. We conduct extensive experiments with four
CTR prediction networks on two real-world datasets and successfully scaled 128
times the original batch size without accuracy loss. In particular, for CTR
prediction model DeepFM training on the Criteo dataset, our optimization
framework enlarges the batch size from 1K to 128K with over 0.1% AUC
improvement and reduces training time from 12 hours to 10 minutes on a single
V100 GPU. Our code locates at https://github.com/zhengzangw/LargeBatchCTR.
Related papers
- KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks.
We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process.
Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Adaptive Low-Precision Training for Embeddings in Click-Through Rate
Prediction [36.605153166169224]
Embedding tables are usually huge in click-through rate (CTR) prediction models.
We formulate a novel quantization training paradigm to compress the embeddings from the training stage, termed low-precision training.
For the first time in CTR models, we successfully train 8-bit embeddings without sacrificing prediction accuracy.
arXiv Detail & Related papers (2022-12-12T07:19:14Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks [28.502489028888608]
Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs.
In adversarial training, the unlearnability of error-minimizing noise will severely degrade.
We propose a novel model-free method, named emphOne-Pixel Shortcut, which only perturbs a single pixel of each image and makes the dataset unlearnable.
arXiv Detail & Related papers (2022-05-24T15:17:52Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - ClickTrain: Efficient and Accurate End-to-End Deep Learning Training via
Fine-Grained Architecture-Preserving Pruning [35.22893238058557]
Convolutional neural networks (CNNs) are becoming increasingly deeper, wider, and non-linear.
We propose ClickTrain: an efficient end-to-end training and pruning framework for CNNs.
arXiv Detail & Related papers (2020-11-20T01:46:56Z) - Predicting Training Time Without Training [120.92623395389255]
We tackle the problem of predicting the number of optimization steps that a pre-trained deep network needs to converge to a given value of the loss function.
We leverage the fact that the training dynamics of a deep network during fine-tuning are well approximated by those of a linearized model.
We are able to predict the time it takes to fine-tune a model to a given loss without having to perform any training.
arXiv Detail & Related papers (2020-08-28T04:29:54Z) - RIFLE: Backpropagation in Depth for Deep Transfer Learning through
Re-Initializing the Fully-connected LayEr [60.07531696857743]
Fine-tuning the deep convolution neural network(CNN) using a pre-trained model helps transfer knowledge learned from larger datasets to the target task.
We propose RIFLE - a strategy that deepens backpropagation in transfer learning settings.
RIFLE brings meaningful updates to the weights of deep CNN layers and improves low-level feature learning.
arXiv Detail & Related papers (2020-07-07T11:27:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.