Related papers: Quantized Training of Gradient Boosting Decision Trees

Quantized Training of Gradient Boosting Decision Trees

URL: http://arxiv.org/abs/2207.09682v1
Date: Wed, 20 Jul 2022 06:27:06 GMT
Title: Quantized Training of Gradient Boosting Decision Trees
Authors: Yu Shi, Guolin Ke, Zhuoming Chen, Shuxin Zheng, Tie-Yan Liu
Abstract summary: We propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm. With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits. We observe up to 2$times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets.
Score: 84.97123593657584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent years have witnessed significant success in Gradient Boosting Decision Trees (GBDT) for a wide range of machine learning applications. Generally, a consensus about GBDT's training algorithms is gradients and statistics are computed based on high-precision floating points. In this paper, we investigate an essentially important question which has been largely ignored by the previous literature: how many bits are needed for representing gradients in training GBDT? To solve this mystery, we propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm. Surprisingly, both our theoretical analysis and empirical studies show that the necessary precisions of gradients without hurting any performance can be quite low, e.g., 2 or 3 bits. With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits. Promisingly, these findings may pave the way for much more efficient training of GBDT from several aspects: (1) speeding up the computation of gradient statistics in histograms; (2) compressing the communication cost of high-precision statistical information during distributed training; (3) the inspiration of utilization and development of hardware architectures which well support low-precision computation for GBDT training. Benchmarked on CPU, GPU, and distributed clusters, we observe up to 2$\times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets, demonstrating the effectiveness and potential of the low-precision training of GBDT. The code will be released to the official repository of LightGBM.

Related papers

Gradient-Free Neural Network Training on the Edge [12.472204825917629]
Training neural networks is computationally heavy and energy-intensive. This work presents a novel technique for training neural networks without needing gradient. We show that it is possible to train models without gradient-based optimization techniques by identifying erroneous contributions of each neuron towards the expected classification.
arXiv Detail & Related papers (2024-10-13T05:38:39Z)
Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance [51.36243421001282]
Gradient-Mask Tuning (GMT) is a method that selectively updates parameters during training based on their gradient information. Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance.
arXiv Detail & Related papers (2024-06-21T17:42:52Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing. We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency. Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z)
SketchBoost: Fast Gradient Boosted Decision Tree for Multioutput Problems [3.04585143845864]
Gradient Boosted Decision Tree (GBDT) is a widely-used machine learning algorithm. We propose novel methods aiming to accelerate the training process of GBDT in the multioutput scenario. Our numerical study demonstrates that SketchBoost speeds up the training process of GBDT by up to over 40 times.
arXiv Detail & Related papers (2022-11-23T11:06:10Z)
Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization. Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference. This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z)
Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment [0.6091702876917281]
gradient sparsification has been proposed to reduce the communication traffic significantly. Top-k gradient sparsification (Top-k SGD) has a limit to increase the speed up overall training performance. We conduct experiments that show the inefficiency of Top-k SGD and provide the insight of the low performance.
arXiv Detail & Related papers (2022-09-18T07:42:31Z)
Distribution Adaptive INT8 Quantization for Training CNNs [12.708068468737286]
In this paper, we propose a novel INT8 quantization training framework for convolutional neural network. Specifically, we adopt Gradient Vectorized Quantization to quantize the gradient, based on the observation that layer-wise gradients contain multiple distributions along the channel dimension. Then, Magnitude-aware Clipping Strategy is introduced by taking the magnitudes of gradients into consideration when minimizing the quantization error.
arXiv Detail & Related papers (2021-02-09T11:58:10Z)
Scalable Graph Neural Networks via Bidirectional Propagation [89.70835710988395]
Graph Neural Networks (GNN) is an emerging field for learning on non-Euclidean data. This paper presents GBP, a scalable GNN that utilizes a localized bidirectional propagation process from both the feature vectors and the training/testing nodes. An empirical study demonstrates that GBP achieves state-of-the-art performance with significantly less training/testing time.
arXiv Detail & Related papers (2020-10-29T08:55:33Z)
Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.