On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning
- URL: http://arxiv.org/abs/2205.09121v2
- Date: Wed, 4 Oct 2023 14:44:35 GMT
- Title: On the efficiency of Stochastic Quasi-Newton Methods for Deep Learning
- Authors: Mahsa Yousefi, Angeles Martinez
- Abstract summary: We study the behaviour of quasi-Newton training algorithms for deep memory networks.
We show that quasi-Newtons are efficient and able to outperform in some instances the well-known first-order Adam run.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While first-order methods are popular for solving optimization problems that
arise in large-scale deep learning problems, they come with some acute
deficiencies. To diminish such shortcomings, there has been recent interest in
applying second-order methods such as quasi-Newton based methods which
construct Hessians approximations using only gradient information. The main
focus of our work is to study the behaviour of stochastic quasi-Newton
algorithms for training deep neural networks. We have analyzed the performance
of two well-known quasi-Newton updates, the limited memory
Broyden-Fletcher-Goldfarb-Shanno (BFGS) and the Symmetric Rank One (SR1). This
study fills a gap concerning the real performance of both updates and analyzes
whether more efficient training is obtained when using the more robust BFGS
update or the cheaper SR1 formula which allows for indefinite Hessian
approximations and thus can potentially help to better navigate the
pathological saddle points present in the non-convex loss functions found in
deep learning. We present and discuss the results of an extensive experimental
study which includes the effect of batch normalization and network's
architecture, the limited memory parameter, the batch size and the type of
sampling strategy. we show that stochastic quasi-Newton optimizers are
efficient and able to outperform in some instances the well-known first-order
Adam optimizer run with the optimal combination of its numerous
hyperparameters.
Related papers
- A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Learning Large-scale Neural Fields via Context Pruned Meta-Learning [60.93679437452872]
We introduce an efficient optimization-based meta-learning technique for large-scale neural field training.
We show how gradient re-scaling at meta-test time allows the learning of extremely high-quality neural fields.
Our framework is model-agnostic, intuitive, straightforward to implement, and shows significant reconstruction improvements for a wide range of signals.
arXiv Detail & Related papers (2023-02-01T17:32:16Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - BOME! Bilevel Optimization Made Easy: A Simple First-Order Approach [46.457298683984924]
Bilevel optimization (BO) is useful for solving a variety important machine learning problems.
Conventional methods need to differentiate through the low-level optimization process with implicit differentiation.
First-order BO depends only on first-order information, requires no implicit differentiation.
arXiv Detail & Related papers (2022-09-19T01:51:12Z) - Simple Stochastic and Online Gradient DescentAlgorithms for Pairwise
Learning [65.54757265434465]
Pairwise learning refers to learning tasks where the loss function depends on a pair instances.
Online descent (OGD) is a popular approach to handle streaming data in pairwise learning.
In this paper, we propose simple and online descent to methods for pairwise learning.
arXiv Detail & Related papers (2021-11-23T18:10:48Z) - SHINE: SHaring the INverse Estimate from the forward pass for bi-level
optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks.
The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix.
We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z) - Research of Damped Newton Stochastic Gradient Descent Method for Neural
Network Training [6.231508838034926]
First-order methods like gradient descent(SGD) are recently the popular optimization method to train deep neural networks (DNNs)
In this paper, we propose the Damped Newton Descent(DN-SGD) and Gradient Descent Damped Newton(SGD-DN) methods to train DNNs for regression problems with Mean Square Error(MSE) and classification problems with Cross-Entropy Loss(CEL)
Our methods just accurately compute a small part of the parameters, which greatly reduces the computational cost and makes the learning process much faster and more accurate than SGD.
arXiv Detail & Related papers (2021-03-31T02:07:18Z) - Second-order Neural Network Training Using Complex-step Directional
Derivative [41.4333906662624]
We introduce a numerical algorithm for second-order neural network training.
We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference.
We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.
arXiv Detail & Related papers (2020-09-15T13:46:57Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z) - Deep Neural Network Learning with Second-Order Optimizers -- a Practical
Study with a Stochastic Quasi-Gauss-Newton Method [0.0]
We introduce and study a second-order quasi-Gauss-Newton (SQGN) optimization method that combines ideas from quasi-Newton methods, Gauss-Newton methods, and variance reduction to address this problem.
We discuss the implementation of SQGN with benchmark, and we compare its convergence and computational performance to selected first-order methods.
arXiv Detail & Related papers (2020-04-06T23:41:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.