Related papers: On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning

On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning

URL: http://arxiv.org/abs/2205.09121v2
Date: Wed, 4 Oct 2023 14:44:35 GMT
Title: On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning
Authors: Mahsa Yousefi, Angeles Martinez
Abstract summary: We study the behaviour of quasi-Newton training algorithms for deep memory networks. We show that quasi-Newtons are efficient and able to outperform in some instances the well-known first-order Adam run.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While first-order methods are popular for solving optimization problems that arise in large-scale deep learning problems, they come with some acute deficiencies. To diminish such shortcomings, there has been recent interest in applying second-order methods such as quasi-Newton based methods which construct Hessians approximations using only gradient information. The main focus of our work is to study the behaviour of stochastic quasi-Newton algorithms for training deep neural networks. We have analyzed the performance of two well-known quasi-Newton updates, the limited memory Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the Symmetric Rank One (SR1). This study fills a gap concerning the real performance of both updates and analyzes whether more efficient training is obtained when using the more robust BFGS update or the cheaper SR1 formula which allows for indefinite Hessian approximations and thus can potentially help to better navigate the pathological saddle points present in the non-convex loss functions found in deep learning. We present and discuss the results of an extensive experimental study which includes the effect of batch normalization and network's architecture, the limited memory parameter, the batch size and the type of sampling strategy. we show that stochastic quasi-Newton optimizers are efficient and able to outperform in some instances the well-known first-order Adam optimizer run with the optimal combination of its numerous hyperparameters.

Related papers

Symmetric Rank-One Quasi-Newton Methods for Deep Learning Using Cubic Regularization [0.5120567378386615]
First-order descent and other first-order variants, such as Adam and AdaGrad, are commonly used in the field of deep learning. However, these methods do not exploit curvature information. Quasi-Newton methods re-use previously computed low Hessian approximations.
arXiv Detail & Related papers (2025-02-17T20:20:11Z)
Linearly Convergent Mixup Learning [0.0]
We present two novel algorithms that extend to a broader range of binary classification models. Unlike gradient-based approaches, our algorithms do not require hyper parameters like learning rates, simplifying their implementation and optimization. Our algorithms achieve faster convergence to the optimal solution compared to descent gradient approaches, and that mixup data augmentation consistently improves the predictive performance across various loss functions.
arXiv Detail & Related papers (2025-01-14T02:33:40Z)
A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning. These problems are often formalized as Bi-Level optimizations (BLO) We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
Learning Large-scale Neural Fields via Context Pruned Meta-Learning [60.93679437452872]
We introduce an efficient optimization-based meta-learning technique for large-scale neural field training. We show how gradient re-scaling at meta-test time allows the learning of extremely high-quality neural fields. Our framework is model-agnostic, intuitive, straightforward to implement, and shows significant reconstruction improvements for a wide range of signals.
arXiv Detail & Related papers (2023-02-01T17:32:16Z)
Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z)
BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach [46.457298683984924]
Bilevel optimization (BO) is useful for solving a variety important machine learning problems. Conventional methods need to differentiate through the low-level optimization process with implicit differentiation. First-order BO depends only on first-order information, requires no implicit differentiation.
arXiv Detail & Related papers (2022-09-19T01:51:12Z)
Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances. Online descent (OGD) is a popular approach to handle streaming data in pairwise learning. In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z)
SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z)
Research of Damped Newton Stochastic Gradient Descent Method for Neural Network Training [6.231508838034926]
First-order methods like gradient descent(SGD) are recently the popular optimization method to train deep neural networks (DNNs) In this paper, we propose the Damped Newton Descent(DN-SGD) and Gradient Descent Damped Newton(SGD-DN) methods to train DNNs for regression problems with Mean Square Error(MSE) and classification problems with Cross-Entropy Loss(CEL) Our methods just accurately compute a small part of the parameters, which greatly reduces the computational cost and makes the learning process much faster and more accurate than SGD.
arXiv Detail & Related papers (2021-03-31T02:07:18Z)
Second-order Neural Network Training Using Complex-step Directional Derivative [41.4333906662624]
We introduce a numerical algorithm for second-order neural network training. We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference. We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.
arXiv Detail & Related papers (2020-09-15T13:46:57Z)
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network. Our model requires a much less number of communication rounds and still a number of communication rounds in theory. Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
Deep Neural Network Learning with Second-Order Optimizers -- a Practical Study with a Stochastic Quasi-Gauss-Newton Method [0.0]
We introduce and study a second-order quasi-Gauss-Newton (SQGN) optimization method that combines ideas from quasi-Newton methods, Gauss-Newton methods, and variance reduction to address this problem. We discuss the implementation of SQGN with benchmark, and we compare its convergence and computational performance to selected first-order methods.
arXiv Detail & Related papers (2020-04-06T23:41:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.