High-Order Optimization of Gradient Boosted Decision Trees
- URL: http://arxiv.org/abs/2211.11367v1
- Date: Mon, 21 Nov 2022 11:33:16 GMT
- Title: High-Order Optimization of Gradient Boosted Decision Trees
- Authors: Jean Pachebat, Sergei Ivanov
- Abstract summary: We introduce high-order optimization for GBDTs based on numerical optimization theory.
We show that high-order optimization has faster per-iteration convergence that leads to reduced running time.
- Score: 1.4047579643483785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gradient Boosted Decision Trees (GBDTs) are dominant machine learning
algorithms for modeling discrete or tabular data. Unlike neural networks with
millions of trainable parameters, GBDTs optimize loss function in an additive
manner and have a single trainable parameter per leaf, which makes it easy to
apply high-order optimization of the loss function. In this paper, we introduce
high-order optimization for GBDTs based on numerical optimization theory which
allows us to construct trees based on high-order derivatives of a given loss
function. In the experiments, we show that high-order optimization has faster
per-iteration convergence that leads to reduced running time. Our solution can
be easily parallelized and run on GPUs with little overhead on the code.
Finally, we discuss future potential improvements such as automatic
differentiation of arbitrary loss function and combination of GBDTs with neural
networks.
Related papers
- Gradient-free neural topology optimization [0.0]
gradient-free algorithms require many more iterations to converge when compared to gradient-based algorithms.
This has made them unviable for topology optimization due to the high computational cost per iteration and high dimensionality of these problems.
We propose a pre-trained neural reparameterization strategy that leads to at least one order of magnitude decrease in iteration count when optimizing the designs in latent space.
arXiv Detail & Related papers (2024-03-07T23:00:49Z) - SGD with Partial Hessian for Deep Neural Networks Optimization [18.78728272603732]
We propose a compound, which is a combination of a second-order with a precise partial Hessian matrix for updating channel-wise parameters and the first-order gradient descent (SGD) algorithms for updating the other parameters.
Compared with first-orders, it adopts a certain amount of information from the Hessian matrix to assist optimization, while compared with the existing second-order generalizations, it keeps the good performance of first-order generalizations imprecise.
arXiv Detail & Related papers (2024-03-05T06:10:21Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - Alternating Differentiation for Optimization Layers [133.2668019610731]
We develop a new framework, named Alternating Differentiation (Alt-Diff), that differentiates optimization problems.
We show that Alt-Diff substantially decreases the dimensions of the Jacobian matrix especially for optimization with large-scale constraints.
We also propose to truncate Alt-Diff to further accelerate the computational speed.
arXiv Detail & Related papers (2022-10-03T11:32:13Z) - Joint inference and input optimization in equilibrium networks [68.63726855991052]
deep equilibrium model is a class of models that foregoes traditional network depth and instead computes the output of a network by finding the fixed point of a single nonlinear layer.
We show that there is a natural synergy between these two settings.
We demonstrate this strategy on various tasks such as training generative models while optimizing over latent codes, training models for inverse problems like denoising and inpainting, adversarial training and gradient based meta-learning.
arXiv Detail & Related papers (2021-11-25T19:59:33Z) - Learned Optimizers for Analytic Continuation [0.0]
We propose a neural network method by convex optimization and replace the ill-posed inverse problem by a sequence of well-conditioned problems.
After training, the learned surrogates are able to give a solution of high quality with low time cost.
Our methods may be easily extended to other high-dimensional inverse problems via large-scale pretraining.
arXiv Detail & Related papers (2021-07-28T10:57:32Z) - Implicit differentiation for fast hyperparameter selection in non-smooth
convex learning [87.60600646105696]
We study first-order methods when the inner optimization problem is convex but non-smooth.
We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian.
arXiv Detail & Related papers (2021-05-04T17:31:28Z) - Divide and Learn: A Divide and Conquer Approach for Predict+Optimize [50.03608569227359]
The predict+optimize problem combines machine learning ofproblem coefficients with a optimization prob-lem that uses the predicted coefficients.
We show how to directlyexpress the loss of the optimization problem in terms of thepredicted coefficients as a piece-wise linear function.
We propose a novel divide and algorithm to tackle optimization problems without this restriction and predict itscoefficients using the optimization loss.
arXiv Detail & Related papers (2020-12-04T00:26:56Z) - A Dynamical View on Optimization Algorithms of Overparameterized Neural
Networks [23.038631072178735]
We consider a broad class of optimization algorithms that are commonly used in practice.
As a consequence, we can leverage the convergence behavior of neural networks.
We believe our approach can also be extended to other optimization algorithms and network theory.
arXiv Detail & Related papers (2020-10-25T17:10:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.