Related papers: Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

URL: http://arxiv.org/abs/2309.10370v2
Date: Sun, 17 Mar 2024 08:09:45 GMT
Title: Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization
Authors: Thomas Chen, Patricia Muñoz Ewald,
Abstract summary: We consider shallow neural networks with one hidden layer, a ReLU activation function, an $mathcal L2$ Schatten class (or Hilbert-Schmidt) cost function. We prove an upper bound on the minimum of the cost function of order $O(delta_P)$ where $delta_P$ measures the signal to noise ratio of training inputs. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Qleq M$ by a
Score: 1.189367612437469
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow neural networks through the explicit construction of upper bounds, without any use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider shallow neural networks with one hidden layer, a ReLU activation function, an ${\mathcal L}^2$ Schatten class (or Hilbert-Schmidt) cost function, input space ${\mathbb R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size $N>QM$ that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(\delta_P)$ where $\delta_P$ measures the signal to noise ratio of training inputs. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(\delta_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

Related papers

Surrogate to Poincaré inequalities on manifolds for dimension reduction in nonlinear feature spaces [49.1574468325115]
We aim to approximate a continuously differentiable function $u:mathbbRd rightarrow mathbbRm$ by a composition of functions $fcirc g$ where $g:mathbbRd rightarrow mathbbRm$, $mleq d$, and $f : mathbbRm rightarrow mathbbRR$.<n>For a fixed $g$, we build $f$ using classical regression methods, involving evaluations
arXiv Detail & Related papers (2025-05-03T12:37:27Z)
Guessing Efficiently for Constrained Subspace Approximation [49.83981776254246]
We introduce a general framework for constrained subspace approximation. We show it provides new algorithms for partition-constrained subspace approximation with applications to $k$-means clustering, and projected non-negative matrix factorization.
arXiv Detail & Related papers (2025-04-29T15:56:48Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch. We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work. We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z)
Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data. We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. In this work, we initiate the study of provably learning a multi-head attention layer from random examples. We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z)
Learning Hierarchical Polynomials with Three-Layer Neural Networks [56.71223169861528]
We study the problem of learning hierarchical functions over the standard Gaussian distribution with three-layer neural networks. For a large subclass of degree $k$s $p$, a three-layer neural network trained via layerwise gradientp descent on the square loss learns the target $h$ up to vanishing test error. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
arXiv Detail & Related papers (2023-11-23T02:19:32Z)
$O(k)$-Equivariant Dimensionality Reduction on Stiefel Manifolds [2.0818404738530525]
Many real-world datasets live on high-dimensional Stiefel and Grassmannian manifold, $V_k(mathbbRN)$ and $Gr(k, mathbbRN)$ respectively.<n>We propose an algorithm called textitPrincipal Stiefel Coordinates (PSC) to reduce data dimensionality from $ V_k(mathbbRN)$ to $V_k(mathbbRn)$ in an textit$O(k)$-equivariant manner
arXiv Detail & Related papers (2023-09-19T17:21:12Z)
Geometric structure of Deep Learning networks and construction of global ${\mathcal L}^2$ minimizers [1.189367612437469]
We explicitly determine local and global minimizers of the $mathcalL2$ cost function in underparametrized Deep Learning (DL) networks.
arXiv Detail & Related papers (2023-09-19T14:20:55Z)
Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z)
Learning linear dynamical systems under convex constraints [4.4351901934764975]
We consider the problem of identification of linear dynamical systems from $T$ samples of a single trajectory. $A*$ can be reliably estimated for values $T$ smaller than what is needed for unconstrained setting.
arXiv Detail & Related papers (2023-03-27T11:49:40Z)
Estimating the minimizer and the minimum value of a regression function under passive design [72.85024381807466]
We propose a new method for estimating the minimizer $boldsymbolx*$ and the minimum value $f*$ of a smooth and strongly convex regression function $f$. We derive non-asymptotic upper bounds for the quadratic risk and optimization error of $boldsymbolz_n$, and for the risk of estimating $f*$.
arXiv Detail & Related papers (2022-11-29T18:38:40Z)
Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev and Besov Spaces [2.7195102129095003]
Deep neural networks with the ReLU activation function can approximate functions in the Sobolev spaces $Ws(L_q(Omega))$ and Besov spaces $Bs_r(L_q(Omega))$. This problem is important when studying the application of neural networks in a variety of fields.
arXiv Detail & Related papers (2022-11-25T23:32:26Z)
Learning (Very) Simple Generative Models Is Hard [45.13248517769758]
We show that no-time algorithm can solve problem even when output coordinates of $mathbbRdtobbRd'$ are one-hidden-layer ReLU networks with $mathrmpoly(d)$ neurons. Key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with neurally-bounded slopes such that the pushforward of $mathcalN(0,1)$ under $f$ matches all low-degree moments of $mathcal
arXiv Detail & Related papers (2022-05-31T17:59:09Z)
On minimal representations of shallow ReLU networks [0.0]
We show that the minimal representation for $f$ uses either $n$, $n+1$ or $n+2$ neurons. In particular, where the input layer is one-dimensional, minimal representations always use at most $n+1$ neurons but in all higher dimensional settings there are functions for which $n+2$ neurons are needed.
arXiv Detail & Related papers (2021-08-12T10:22:24Z)
Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances [9.390008801320024]
We show that adding one extra neuron to each is sufficient to connect all previously discrete minima into a single manifold. We show that the number of symmetry-induced critical subspaces dominates the number of affine subspaces forming the global minima manifold.
arXiv Detail & Related papers (2021-05-25T21:19:07Z)
Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$. For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.