A block coordinate descent optimizer for classification problems
exploiting convexity
- URL: http://arxiv.org/abs/2006.10123v1
- Date: Wed, 17 Jun 2020 19:49:06 GMT
- Title: A block coordinate descent optimizer for classification problems
exploiting convexity
- Authors: Ravi G. Patel, Nathaniel A. Trask, Mamikon A. Gulian, Eric C. Cyr
- Abstract summary: We introduce a coordinate descent method to deep linear networks for classification tasks that exploits convexity of the cross-entropy loss in the weights of the hidden layer.
By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Second-order optimizers hold intriguing potential for deep learning, but
suffer from increased cost and sensitivity to the non-convexity of the loss
surface as compared to gradient-based approaches. We introduce a coordinate
descent method to train deep neural networks for classification tasks that
exploits global convexity of the cross-entropy loss in the weights of the
linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent
with the interpretation of hidden layers as providing an adaptive basis and the
linear layer as providing an optimal fit of the basis to data. By alternating
between a second-order method to find globally optimal parameters for the
linear layer and gradient descent to train the hidden layers, we ensure an
optimal fit of the adaptive basis to data throughout training. The size of the
Hessian in the second-order step scales only with the number weights in the
linear layer and not the depth and width of the hidden layers; furthermore, the
approach is applicable to arbitrary hidden layer architecture. Previous work
applying this adaptive basis perspective to regression problems demonstrated
significant improvements in accuracy at reduced training cost, and this work
can be viewed as an extension of this approach to classification problems. We
first prove that the resulting Hessian matrix is symmetric semi-definite, and
that the Newton step realizes a global minimizer. By studying classification of
manufactured two-dimensional point cloud data, we demonstrate both an
improvement in validation error and a striking qualitative difference in the
basis functions encoded in the hidden layer when trained using NGD. Application
to image classification benchmarks for both dense and convolutional
architectures reveals improved training accuracy, suggesting possible gains of
second-order methods over gradient descent.
Related papers
- Gradient-Variation Online Learning under Generalized Smoothness [56.38427425920781]
gradient-variation online learning aims to achieve regret guarantees that scale with variations in gradients of online functions.
Recent efforts in neural network optimization suggest a generalized smoothness condition, allowing smoothness to correlate with gradient norms.
We provide the applications for fast-rate convergence in games and extended adversarial optimization.
arXiv Detail & Related papers (2024-08-17T02:22:08Z) - Adaptive Federated Learning Over the Air [108.62635460744109]
We propose a federated version of adaptive gradient methods, particularly AdaGrad and Adam, within the framework of over-the-air model training.
Our analysis shows that the AdaGrad-based training algorithm converges to a stationary point at the rate of $mathcalO( ln(T) / T 1 - frac1alpha ).
arXiv Detail & Related papers (2024-03-11T09:10:37Z) - Neural Gradient Learning and Optimization for Oriented Point Normal
Estimation [53.611206368815125]
We propose a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation.
We learn an angular distance field based on local plane geometry to refine the coarse gradient vectors.
Our method efficiently conducts global gradient approximation while achieving better accuracy and ability generalization of local feature description.
arXiv Detail & Related papers (2023-09-17T08:35:11Z) - Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for
Deep Learning [8.173034693197351]
We propose a new per-layer adaptive step-size procedure for first-order optimization methods in deep learning.
The proposed approach exploits the layer-wise curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer.
Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules.
arXiv Detail & Related papers (2023-05-23T04:12:55Z) - Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data [63.34506218832164]
In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with ReLU activations.
For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that leakyally, gradient flow produces a neural network with rank at most two.
For gradient descent, provided the random variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training.
arXiv Detail & Related papers (2022-10-13T15:09:54Z) - Cogradient Descent for Dependable Learning [64.02052988844301]
We propose a dependable learning based on Cogradient Descent (CoGD) algorithm to address the bilinear optimization problem.
CoGD is introduced to solve bilinear problems when one variable is with sparsity constraint.
It can also be used to decompose the association of features and weights, which further generalizes our method to better train convolutional neural networks (CNNs)
arXiv Detail & Related papers (2021-06-20T04:28:20Z) - Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem.
We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent.
Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z) - The duality structure gradient descent algorithm: analysis and applications to neural networks [0.0]
We propose an algorithm named duality structure gradient descent (DSGD) that is amenable to non-asymptotic performance analysis.
We empirically demonstrate the behavior of DSGD in several neural network training scenarios.
arXiv Detail & Related papers (2017-08-01T21:24:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.