Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the
Hessian
- URL: http://arxiv.org/abs/2011.06505v1
- Date: Thu, 12 Nov 2020 17:15:09 GMT
- Title: Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the
Hessian
- Authors: Jack Parker-Holder, Luke Metz, Cinjon Resnick, Hengyuan Hu, Adam
Lerer, Alistair Letcher, Alex Peysakhovich, Aldo Pacchiano, Jakob Foerster
- Abstract summary: Gradient Descent (SGD) is a key component of the success of deep neural networks (DNNs)
In this paper, we present a different approach by following the eigenvectors of the Hessian, which we call "ridges"
We show both theoretically and experimentally that our method, called Ridge Rider (RR), offers a promising direction for a variety of challenging problems.
- Score: 48.61341260604871
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over the last decade, a single algorithm has changed many facets of our lives
- Stochastic Gradient Descent (SGD). In the era of ever decreasing loss
functions, SGD and its various offspring have become the go-to optimization
tool in machine learning and are a key component of the success of deep neural
networks (DNNs). While SGD is guaranteed to converge to a local optimum (under
loose assumptions), in some cases it may matter which local optimum is found,
and this is often context-dependent. Examples frequently arise in machine
learning, from shape-versus-texture-features to ensemble methods and zero-shot
coordination. In these settings, there are desired solutions which SGD on
'standard' loss functions will not find, since it instead converges to the
'easy' solutions. In this paper, we present a different approach. Rather than
following the gradient, which corresponds to a locally greedy direction, we
instead follow the eigenvectors of the Hessian, which we call "ridges". By
iteratively following and branching amongst the ridges, we effectively span the
loss surface to find qualitatively different solutions. We show both
theoretically and experimentally that our method, called Ridge Rider (RR),
offers a promising direction for a variety of challenging problems.
Related papers
- Universal Online Learning with Gradient Variations: A Multi-layer Online Ensemble Approach [57.92727189589498]
We propose an online convex optimization approach with two different levels of adaptivity.
We obtain $mathcalO(log V_T)$, $mathcalO(d log V_T)$ and $hatmathcalO(sqrtV_T)$ regret bounds for strongly convex, exp-concave and convex loss functions.
arXiv Detail & Related papers (2023-07-17T09:55:35Z) - Efficient Quality-Diversity Optimization through Diverse Quality Species [3.428706362109921]
We show that a diverse population of solutions can be found without the limitation of needing an archive or defining the range of behaviors in advance.
We propose Diverse Quality Species (DQS) as an alternative to archive-based Quality-Diversity (QD) algorithms.
arXiv Detail & Related papers (2023-04-14T23:15:51Z) - Random initialisations performing above chance and how to find them [22.812660025650253]
Entezari et al. recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks.
Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks.
We find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance.
arXiv Detail & Related papers (2022-09-15T17:52:54Z) - Adaptive Self-supervision Algorithms for Physics-informed Neural
Networks [59.822151945132525]
Physics-informed neural networks (PINNs) incorporate physical knowledge from the problem domain as a soft constraint on the loss function.
We study the impact of the location of the collocation points on the trainability of these models.
We propose a novel adaptive collocation scheme which progressively allocates more collocation points to areas where the model is making higher errors.
arXiv Detail & Related papers (2022-07-08T18:17:06Z) - On the Convergence to a Global Solution of Shuffling-Type Gradient
Algorithms [18.663264755108703]
gradient descent (SGD) algorithm is the method of choice in many machine learning tasks.
In this paper, we show that SGD has achieved the desired computational general complexity as convex setting.
arXiv Detail & Related papers (2022-06-13T01:25:59Z) - Message Passing Neural PDE Solvers [60.77761603258397]
We build a neural message passing solver, replacing allally designed components in the graph with backprop-optimized neural function approximators.
We show that neural message passing solvers representationally contain some classical methods, such as finite differences, finite volumes, and WENO schemes.
We validate our method on various fluid-like flow problems, demonstrating fast, stable, and accurate performance across different domain topologies, equation parameters, discretizations, etc., in 1D and 2D.
arXiv Detail & Related papers (2022-02-07T17:47:46Z) - Lyapunov Exponents for Diversity in Differentiable Games [19.16909724435523]
Ridge Rider (RR) is an algorithm for finding diverse solutions to optimization problems by following eigenvectors of the Hessian ("ridges"
RR is designed for conservative gradient systems, where it branches at saddles - easy-to-find bifurcation points.
We propose a method - denoted Generalized Ridge Rider (GRR) - for finding arbitrary bifurcation points.
arXiv Detail & Related papers (2021-12-24T22:48:14Z) - Direction Matters: On the Implicit Bias of Stochastic Gradient Descent
with Moderate Learning Rate [105.62979485062756]
This paper attempts to characterize the particular regularization effect of SGD in the moderate learning rate regime.
We show that SGD converges along the large eigenvalue directions of the data matrix, while GD goes after the small eigenvalue directions.
arXiv Detail & Related papers (2020-11-04T21:07:52Z) - Physarum Powered Differentiable Linear Programming Layers and
Applications [48.77235931652611]
We propose an efficient and differentiable solver for general linear programming problems.
We show the use of our solver in a video segmentation task and meta-learning for few-shot learning.
arXiv Detail & Related papers (2020-04-30T01:50:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.