Related papers: Near-optimal estimates for the $\ell^p$-Lipschitz constants of deep random ReLU neural networks

Near-optimal estimates for the $\ell^p$-Lipschitz constants of deep random ReLU neural networks

URL: http://arxiv.org/abs/2506.19695v1
Date: Tue, 24 Jun 2025 15:02:16 GMT
Title: Near-optimal estimates for the $\ell^p$-Lipschitz constants of deep random ReLU neural networks
Authors: Sjoerd Dirksen, Patrick Finke, Paul Geuchen, Dominik Stöger, Felix Voigtlaender,
Abstract summary: We derive high probability upper and lower bounds for wide networks that differ at most by a factor that is logarithmic in the network's width and linear in its depth.<n>Remarkably, the behavior of the $ellp$-Lipschitz constant varies significantly between the regimes $ p in [1,2) $ and $ p in [2,infty] $.
Score: 3.684988521329369
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper studies the $\ell^p$-Lipschitz constants of ReLU neural networks $\Phi: \mathbb{R}^d \to \mathbb{R}$ with random parameters for $p \in [1,\infty]$. The distribution of the weights follows a variant of the He initialization and the biases are drawn from symmetric distributions. We derive high probability upper and lower bounds for wide networks that differ at most by a factor that is logarithmic in the network's width and linear in its depth. In the special case of shallow networks, we obtain matching bounds. Remarkably, the behavior of the $\ell^p$-Lipschitz constant varies significantly between the regimes $ p \in [1,2) $ and $ p \in [2,\infty] $. For $p \in [2,\infty]$, the $\ell^p$-Lipschitz constant behaves similarly to $\Vert g\Vert_{p'}$, where $g \in \mathbb{R}^d$ is a $d$-dimensional standard Gaussian vector and $1/p + 1/p' = 1$. In contrast, for $p \in [1,2)$, the $\ell^p$-Lipschitz constant aligns more closely to $\Vert g \Vert_{2}$.

Related papers

Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$<n>We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z)
A Unified Framework for Uniform Signal Recovery in Nonlinear Generative Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously. Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples. We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z)
$\ell_p$-Regression in the Arbitrary Partition Model of Communication [59.89387020011663]
We consider the randomized communication complexity of the distributed $ell_p$-regression problem in the coordinator model. For $p = 2$, i.e., least squares regression, we give the first optimal bound of $tildeTheta(sd2 + sd/epsilon)$ bits. For $p in (1,2)$,we obtain an $tildeO(sd2/epsilon + sd/mathrmpoly(epsilon)$ upper bound.
arXiv Detail & Related papers (2023-07-11T08:51:53Z)
Optimal Sketching Bounds for Sparse Linear Regression [116.30196615349226]
We study oblivious sketching for $k$-sparse linear regression under various loss functions such as an $ell_p$ norm, or from a broad class of hinge-like loss functions. We show that for sparse $ell$vareps regression, there is an oblivious distribution over sketches with $Theta(klog(d/k)/varepsilon2)$ rows, which is tight up to a constant factor. We also show that sketching $O(mu2 klog(mu n d/varepsilon)/var
arXiv Detail & Related papers (2023-04-05T07:24:19Z)
A spectral least-squares-type method for heavy-tailed corrupted regression with unknown covariance \& heterogeneous noise [2.019622939313173]
We revisit heavy-tailed corrupted least-squares linear regression assuming to have a corrupted $n$-sized label-feature sample of at most $epsilon n$ arbitrary outliers. We propose a near-optimal computationally tractable estimator, based on the power method, assuming no knowledge on $(Sigma,Xi) nor the operator norm of $Xi$.
arXiv Detail & Related papers (2022-09-06T23:37:31Z)
Near-Linear Time and Fixed-Parameter Tractable Algorithms for Tensor Decompositions [51.19236668224547]
We study low rank approximation of tensors, focusing on the tensor train and Tucker decompositions. For tensor train decomposition, we give a bicriteria $(1 + eps)$-approximation algorithm with a small bicriteria rank and $O(q cdot nnz(A))$ running time. In addition, we extend our algorithm to tensor networks with arbitrary graphs.
arXiv Detail & Related papers (2022-07-15T11:55:09Z)
Low-degree learning and the metric entropy of polynomials [44.99833362998488]
We prove that any (deterministic or randomized) algorithm which learns $mathscrF_nd$ with $L$-accuracy $varepsilon$ requires at least $Omega(sqrtvarepsilon)2dlog n leq log mathsfM(mathscrF_n,d,|cdot|_L,varepsilon) satisfies the two-sided estimate $$c (1-varepsilon)2dlog
arXiv Detail & Related papers (2022-03-17T23:52:08Z)
ReLU Network Approximation in Terms of Intrinsic Parameters [5.37133760455631]
We study the approximation error of ReLU networks in terms of the number of intrinsic parameters. We design a ReLU network with only three intrinsic parameters to approximate H"older continuous functions with an arbitrarily small error.
arXiv Detail & Related papers (2021-11-15T18:20:38Z)
Ridgeless Interpolation with Shallow ReLU Networks in $1D$ is Nearest Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz Functions [16.75218291152252]
We prove a precise geometric description of all one layer ReLU networks $z(x;theta)$ with a single linear unit. We refer to them as ridgeless ReLU interpolants. Our results show that ridgeless ReLU interpolants achieve the best possible generalization for learning $1d$ Lipschitz functions.
arXiv Detail & Related papers (2021-09-27T11:32:24Z)
Deep Neural Networks with ReLU-Sine-Exponential Activations Break Curse of Dimensionality on H\"older Class [6.476766717110237]
We construct neural networks with ReLU, sine and $2x$ as activation functions. In addition to its supper expressive power, functions implemented by ReLU-sine-$2x$ networks are (generalized) differentiable.
arXiv Detail & Related papers (2021-02-28T15:57:42Z)
Optimal Approximation Rate of ReLU Networks in terms of Width and Depth [5.37133760455631]
This paper concentrates on the approximation power of deep feed-forward neural networks in terms of width and depth. It is proved that ReLU networks with width $mathcalObig(maxdlfloor N1/drfloor,, N+2big)$ and depth $mathcalO(L)$ can approximate a H"older continuous function on $[0,1]d$ with an approximation rate $mathcalObig(lambdasqrtd (N2L2ln
arXiv Detail & Related papers (2021-02-28T13:15:55Z)
Sparse sketches with small inversion bias [79.77110958547695]
Inversion bias arises when averaging estimates of quantities that depend on the inverse covariance. We develop a framework for analyzing inversion bias, based on our proposed concept of an $(epsilon,delta)$-unbiased estimator for random matrices. We show that when the sketching matrix $S$ is dense and has i.i.d. sub-gaussian entries, the estimator $(epsilon,delta)$-unbiased for $(Atop A)-1$ with a sketch of size $m=O(d+sqrt d/
arXiv Detail & Related papers (2020-11-21T01:33:15Z)
Deep Network with Approximation Error Being Reciprocal of Width to Power of Square Root of Depth [4.468952886990851]
A new network with super approximation power is introduced. This network is built with Floor ($lfloor xrfloor$) or ReLU ($max0,x$) activation function in each neuron.
arXiv Detail & Related papers (2020-06-22T13:27:33Z)
Curse of Dimensionality on Randomized Smoothing for Certifiable Robustness [151.67113334248464]
We show that extending the smoothing technique to defend against other attack models can be challenging. We present experimental results on CIFAR to validate our theory.
arXiv Detail & Related papers (2020-02-08T22:02:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.