Hidden Minima in Two-Layer ReLU Networks
- URL: http://arxiv.org/abs/2312.16819v2
- Date: Mon, 19 Feb 2024 17:33:41 GMT
- Title: Hidden Minima in Two-Layer ReLU Networks
- Authors: Yossi Arjevani
- Abstract summary: Two types of infinite families of spurious minima, giving one minimum per $d$, were recently found.
The loss at minima belonging to the first type converges to zero as $d$ increases.
In the second type, the loss remains bounded away from zero.
- Score: 7.23389716633927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The optimization problem associated to fitting two-layer ReLU networks having
$d$~inputs, $k$~neurons, and labels generated by a target network, is
considered. Two types of infinite families of spurious minima, giving one
minimum per $d$, were recently found. The loss at minima belonging to the first
type converges to zero as $d$ increases. In the second type, the loss remains
bounded away from zero. That being so, how may one avoid minima belonging to
the latter type? Fortunately, such minima are never detected by standard
optimization methods. Motivated by questions concerning the nature of this
phenomenon, we develop methods to study distinctive analytic properties of
hidden minima.
By existing analyses, the Hessian spectrum of both types agree modulo
$O(d^{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds
by studying curves along which the loss is minimized or maximized, generally
referred to as tangency arcs. We prove that apparently far removed group
representation-theoretic considerations concerning the arrangement of subspaces
invariant to the action of subgroups of $S_d$, the symmetry group over $d$
symbols, relative to ones fixed by the action yield a precise description of
all finitely many admissible types of tangency arcs. The general results used
for the loss function reveal that arcs emanating from hidden minima differ,
characteristically, by their structure and symmetry, precisely on account of
the $O(d^{-1/2})$-eigenvalue terms absent in previous work, indicating in
particular the subtlety of the analysis. The theoretical results, stated and
proved for o-minimal structures, show that the set comprising all tangency arcs
is topologically sufficiently tame to enable a numerical construction of
tangency arcs and so compare how minima, both types, are positioned relative to
adjacent critical points.
Related papers
- A Mean-Field Analysis of Neural Stochastic Gradient Descent-Ascent for Functional Minimax Optimization [90.87444114491116]
This paper studies minimax optimization problems defined over infinite-dimensional function classes of overparametricized two-layer neural networks.
We address (i) the convergence of the gradient descent-ascent algorithm and (ii) the representation learning of the neural networks.
Results show that the feature representation induced by the neural networks is allowed to deviate from the initial one by the magnitude of $O(alpha-1)$, measured in terms of the Wasserstein distance.
arXiv Detail & Related papers (2024-04-18T16:46:08Z) - Adam-like Algorithm with Smooth Clipping Attains Global Minima: Analysis
Based on Ergodicity of Functional SDEs [0.0]
We show that an Adam-type algorithm with clipping the globalized non--1 loss function minimizes the regularized non--1 error form.
We also apply the ergodic theory of smooth groups to investigate approaches to learn inverse temperature and time.
arXiv Detail & Related papers (2023-11-29T14:38:59Z) - Detection-Recovery Gap for Planted Dense Cycles [72.4451045270967]
We consider a model where a dense cycle with expected bandwidth $n tau$ and edge density $p$ is planted in an ErdHos-R'enyi graph $G(n,q)$.
We characterize the computational thresholds for the associated detection and recovery problems for the class of low-degree algorithms.
arXiv Detail & Related papers (2023-02-13T22:51:07Z) - Causal Bandits for Linear Structural Equation Models [58.2875460517691]
This paper studies the problem of designing an optimal sequence of interventions in a causal graphical model.
It is assumed that the graph's structure is known and has $N$ nodes.
Two algorithms are proposed for the frequentist (UCB-based) and Bayesian settings.
arXiv Detail & Related papers (2022-08-26T16:21:31Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Origins of Low-dimensional Adversarial Perturbations [17.17170592140042]
We study the phenomenon of low-dimensional adversarial perturbations in classification.
The goal is to fool the classifier into flipping its decision on a nonzero fraction of inputs from a designated class.
We compute lowerbounds for the fooling rate of any subspace.
arXiv Detail & Related papers (2022-03-25T17:02:49Z) - Analytic Study of Families of Spurious Minima in Two-Layer ReLU Neural
Networks [15.711517003382484]
We show that the Hessian spectrum concentrates near positives, with the exception of $Theta(d)$ eigenvalues which growly with$d$.
This makes possible the creation and the annihilation of minima using powerful tools from bifurcation theory.
arXiv Detail & Related papers (2021-07-21T22:05:48Z) - Lattice partition recovery with dyadic CART [79.96359947166592]
We study piece-wise constant signals corrupted by additive Gaussian noise over a $d$-dimensional lattice.
Data of this form naturally arise in a host of applications, and the tasks of signal detection or testing, de-noising and estimation have been studied extensively in the statistical and signal processing literature.
In this paper we consider instead the problem of partition recovery, i.e.of estimating the partition of the lattice induced by the constancy regions of the unknown signal.
We prove that a DCART-based procedure consistently estimates the underlying partition at a rate of order $sigma2 k*
arXiv Detail & Related papers (2021-05-27T23:41:01Z) - Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins [92.7662890047311]
We show that gradient descent finds halfspaces with classification error $tilde O(mathsfOPT1/2) + varepsilon$ in $mathrmpoly(d,1/varepsilon)$ time and sample complexity.
arXiv Detail & Related papers (2020-10-01T16:48:33Z) - Analytic Characterization of the Hessian in Shallow ReLU Models: A Tale
of Symmetry [9.695960412426672]
We analytically characterize the Hessian at various families of spurious minima.
In particular, we prove that for $dge k$ standard Gaussian inputs: (a) of the $dk$ eigenvalues of the Hessian, $dk - O(d)$ concentrate near zero, (b) $Omega(d)$ of the eigenvalues grow linearly with $k$.
arXiv Detail & Related papers (2020-08-04T20:08:35Z) - Symmetry & critical points for a model shallow neural network [9.695960412426672]
We consider the optimization problem associated with fitting two-layer ReLU networks with $k$ hidden neurons.
We leverage the rich symmetry exhibited by such models to identify various families of critical points.
We show that while the loss function at certain types of spurious minima decays to zero like $k-1$, in other cases the loss converges to a strictly positive constant.
arXiv Detail & Related papers (2020-03-23T23:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.