Related papers: Empirically explaining SGD from a line search perspective

Empirically explaining SGD from a line search perspective

URL: http://arxiv.org/abs/2103.17132v1
Date: Wed, 31 Mar 2021 14:54:22 GMT
Title: Empirically explaining SGD from a line search perspective
Authors: Maximus Mutschler and Andreas Zell
Abstract summary: We show that the full-batch loss along lines in update step direction is highly parabolically. We also show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss.
Score: 21.35522589789314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Optimization in Deep Learning is mainly guided by vague intuitions and strong assumptions, with a limited understanding how and why these work in practice. To shed more light on this, our work provides some deeper understandings of how SGD behaves by empirically analyzing the trajectory taken by SGD from a line search perspective. Specifically, a costly quantitative analysis of the full-batch loss along SGD trajectories from common used models trained on a subset of CIFAR-10 is performed. Our core results include that the full-batch loss along lines in update step direction is highly parabolically. Further on, we show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss. Finally, we provide a different perspective why increasing the batch size has almost the same effect as decreasing the learning rate by the same factor.

Related papers

Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity [84.12126298229866]
We show that zero-shot generalization during instruction tuning happens very early. We also show that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level.
arXiv Detail & Related papers (2024-06-17T16:40:21Z)
When and Why Momentum Accelerates SGD:An Empirical Study [76.2666927020119]
This study examines the performance of gradient descent (SGD) with momentum (SGDM) We find that the momentum acceleration is closely related to emphabrupt sharpening which is to describe a sudden jump of the directional Hessian along the update direction. Momentum improves the performance of SGDM by preventing or deferring the occurrence of abrupt sharpening.
arXiv Detail & Related papers (2023-06-15T09:54:21Z)
Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions [26.782342518986503]
gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of problems. We show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD.
arXiv Detail & Related papers (2022-06-15T02:32:26Z)
Risk Bounds of Multi-Pass SGD for Least Squares in the Interpolation Regime [127.21287240963859]
gradient descent (SGD) has achieved great success due to its superior performance in both optimization and generalization. This paper aims to sharply characterize the generalization of multi-pass SGD. We show that although SGD needs more than GD to achieve the same level of excess risk, it saves the number of gradient evaluations.
arXiv Detail & Related papers (2022-03-07T06:34:53Z)
Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training [21.35522589789314]
This work introduces a line-search method that approximates the full-batch loss with a parabola estimated over several mini-batches. In the experiments conducted, our approach mostly outperforms SGD tuned with a piece-wise constant learning rate schedule.
arXiv Detail & Related papers (2021-08-31T14:36:23Z)
SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs [30.41773138781369]
We present a multi-epoch variant of Gradient Descent (SGD) commonly used in practice. We prove that this is at least as good as single pass SGD in the worst case. For certain SCO problems, taking multiple passes over the dataset can significantly outperform single pass SGD.
arXiv Detail & Related papers (2021-07-11T15:50:01Z)
Why Does Multi-Epoch Training Help? [62.946840431501855]
Empirically, it has been observed that taking more one pass over training data (multi-pass SGD) has much better excess risk bound performance than SGD only taking one pass over training data (one-pass SGD) In this paper, we provide some theoretical evidences for explaining why multiple passes over the training data can help improve performance under certain circumstances.
arXiv Detail & Related papers (2021-05-13T00:52:25Z)
Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness. We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z)
Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime. We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z)
Accelerated Convergence for Counterfactual Learning to Rank [65.63997193915257]
We show that convergence rate of SGD approaches with IPS-weighted gradients suffers from the large variance introduced by the IPS weights. We propose a novel learning algorithm, called CounterSample, that has provably better convergence than standard IPS-weighted gradient descent methods. We prove that CounterSample converges faster and complement our theoretical findings with empirical results.
arXiv Detail & Related papers (2020-05-21T12:53:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.