Quantized Training of Gradient Boosting Decision Trees
- URL: http://arxiv.org/abs/2207.09682v1
- Date: Wed, 20 Jul 2022 06:27:06 GMT
- Title: Quantized Training of Gradient Boosting Decision Trees
- Authors: Yu Shi, Guolin Ke, Zhuoming Chen, Shuxin Zheng, Tie-Yan Liu
- Abstract summary: We propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm.
With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits.
We observe up to 2$times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets.
- Score: 84.97123593657584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed significant success in Gradient Boosting Decision
Trees (GBDT) for a wide range of machine learning applications. Generally, a
consensus about GBDT's training algorithms is gradients and statistics are
computed based on high-precision floating points. In this paper, we investigate
an essentially important question which has been largely ignored by the
previous literature: how many bits are needed for representing gradients in
training GBDT? To solve this mystery, we propose to quantize all the
high-precision gradients in a very simple yet effective way in the GBDT's
training algorithm. Surprisingly, both our theoretical analysis and empirical
studies show that the necessary precisions of gradients without hurting any
performance can be quite low, e.g., 2 or 3 bits. With low-precision gradients,
most arithmetic operations in GBDT training can be replaced by integer
operations of 8, 16, or 32 bits. Promisingly, these findings may pave the way
for much more efficient training of GBDT from several aspects: (1) speeding up
the computation of gradient statistics in histograms; (2) compressing the
communication cost of high-precision statistical information during distributed
training; (3) the inspiration of utilization and development of hardware
architectures which well support low-precision computation for GBDT training.
Benchmarked on CPU, GPU, and distributed clusters, we observe up to 2$\times$
speedup of our simple quantization strategy compared with SOTA GBDT systems on
extensive datasets, demonstrating the effectiveness and potential of the
low-precision training of GBDT. The code will be released to the official
repository of LightGBM.
Related papers
- Gradient-Free Neural Network Training on the Edge [12.472204825917629]
Training neural networks is computationally heavy and energy-intensive.
This work presents a novel technique for training neural networks without needing gradient.
We show that it is possible to train models without gradient-based optimization techniques by identifying erroneous contributions of each neuron towards the expected classification.
arXiv Detail & Related papers (2024-10-13T05:38:39Z) - Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance [51.36243421001282]
Gradient-Mask Tuning (GMT) is a method that selectively updates parameters during training based on their gradient information.
Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance.
arXiv Detail & Related papers (2024-06-21T17:42:52Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Towards Memory- and Time-Efficient Backpropagation for Training Spiking
Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing.
We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency.
Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z) - SketchBoost: Fast Gradient Boosted Decision Tree for Multioutput
Problems [3.04585143845864]
Gradient Boosted Decision Tree (GBDT) is a widely-used machine learning algorithm.
We propose novel methods aiming to accelerate the training process of GBDT in the multioutput scenario.
Our numerical study demonstrates that SketchBoost speeds up the training process of GBDT by up to over 40 times.
arXiv Detail & Related papers (2022-11-23T11:06:10Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep
Learning in a Supercomputing Environment [0.6091702876917281]
gradient sparsification has been proposed to reduce the communication traffic significantly.
Top-k gradient sparsification (Top-k SGD) has a limit to increase the speed up overall training performance.
We conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance.
arXiv Detail & Related papers (2022-09-18T07:42:31Z) - Distribution Adaptive INT8 Quantization for Training CNNs [12.708068468737286]
In this paper, we propose a novel INT8 quantization training framework for convolutional neural network.
Specifically, we adopt Gradient Vectorized Quantization to quantize the gradient, based on the observation that layer-wise gradients contain multiple distributions along the channel dimension.
Then, Magnitude-aware Clipping Strategy is introduced by taking the magnitudes of gradients into consideration when minimizing the quantization error.
arXiv Detail & Related papers (2021-02-09T11:58:10Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.