Related papers: Questions for Flat-Minima Optimization of Modern Neural Networks

Questions for Flat-Minima Optimization of Modern Neural Networks

URL: http://arxiv.org/abs/2202.00661v2
Date: Wed, 2 Feb 2022 18:52:25 GMT
Title: Questions for Flat-Minima Optimization of Modern Neural Networks
Authors: Jean Kaddour, Linqing Liu, Ricardo Silva, Matt J. Kusner
Abstract summary: Two methods for finding flat minima stand out: 1. Averaging methods (i.e. Weight Averaging, SWA) and 2. Minimax methods (i.e. Aware, Sharpness Minimization, SAM) We investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks.
Score: 28.12506392321345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: For training neural networks, flat-minima optimizers that seek to find parameters in neighborhoods having uniformly low loss (flat minima) have been shown to improve upon stochastic and adaptive gradient-based methods. Two methods for finding flat minima stand out: 1. Averaging methods (i.e., Stochastic Weight Averaging, SWA), and 2. Minimax methods (i.e., Sharpness Aware Minimization, SAM). However, despite similar motivations, there has been limited investigation into their properties and no comprehensive comparison between them. In this work, we investigate the loss surfaces from a systematic benchmarking of these approaches across computer vision, natural language processing, and graph learning tasks. The results lead to a simple hypothesis: since both approaches find different flat solutions, combining them should improve generalization even further. We verify this improves over either flat-minima approach in 39 out of 42 cases. When it does not, we investigate potential reasons. We hope our results across image, graph, and text data will help researchers to improve deep learning optimizers, and practitioners to pinpoint the optimizer for the problem at hand.

Related papers

FLOPS: Forward Learning with OPtimal Sampling [1.694989793927645]
gradient-based computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling. We propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency.
arXiv Detail & Related papers (2024-10-08T12:16:12Z)
Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters. In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z)
How to escape sharp minima with random perturbations [48.095392390925745]
We study the notion of flat minima and the complexity of finding them. For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization.
arXiv Detail & Related papers (2023-05-25T02:12:33Z)
Adaptive Federated Minimax Optimization with Lower Complexities [82.51223883622552]
We propose an efficient adaptive minimax optimization algorithm (i.e., AdaFGDA) to solve these minimax problems. It builds our momentum-based reduced and localSGD techniques, and it flexibly incorporate various adaptive learning rates.
arXiv Detail & Related papers (2022-11-14T12:32:18Z)
Adaptive Self-supervision Algorithms for Physics-informed Neural Networks [59.822151945132525]
Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function. We study the impact of the location of the collocation points on the trainability of these models. We propose a novel adaptive collocation scheme which progressively allocates more collocation points to areas where the model is making higher errors.
arXiv Detail & Related papers (2022-07-08T18:17:06Z)
AlterSGD: Finding Flat Minima for Continual Learning by Alternative Training [11.521519687645428]
We propose a simple yet effective optimization method, called AlterSGD, to search for a flat minima in the loss landscape. We prove that such a strategy can encourage the optimization to converge to a flat minima. We verify AlterSGD on continual learning benchmark for semantic segmentation and the empirical results show that we can significantly mitigate the forgetting.
arXiv Detail & Related papers (2021-07-13T01:43:51Z)
Unveiling the structure of wide flat minima in neural networks [0.46664938579243564]
Deep learning has revealed the application potential of networks across the sciences. The success of deep learning has revealed the application potential of networks across the sciences.
arXiv Detail & Related papers (2021-07-02T16:04:57Z)
GOALS: Gradient-Only Approximations for Line Searches Towards Robust and Consistent Training of Deep Neural Networks [0.0]
Mini-batch sub-sampling (MBSS) is favored in deep neural network training to reduce the computational cost. We propose a gradient-only approximation line search (GOALS) with strong convergence characteristics with defined optimality criterion.
arXiv Detail & Related papers (2021-05-23T11:21:01Z)
Gradient Free Minimax Optimization: Variance Reduction and Faster Convergence [120.9336529957224]
In this paper, we denote the non-strongly setting on the magnitude of a gradient-free minimax optimization problem. We show that a novel zeroth-order variance reduced descent algorithm achieves the best known query complexity.
arXiv Detail & Related papers (2020-06-16T17:55:46Z)
A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima [91.11332770406007]
We show that Gradient Descent (SGD) favors flat minima exponentially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima.
arXiv Detail & Related papers (2020-02-10T02:04:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.