LU decomposition and Toeplitz decomposition of a neural network
- URL: http://arxiv.org/abs/2211.13935v1
- Date: Fri, 25 Nov 2022 07:26:39 GMT
- Title: LU decomposition and Toeplitz decomposition of a neural network
- Authors: Yucong Liu, Simiao Jiao, and Lek-Heng Lim
- Abstract summary: We show that any continuous function $f : mathbbRn to mathbbRm$ has an approximation to arbitrary accuracy by a neural network.
A consequence of our Toeplitz result is a fixed-width universal approximation for convolutional neural networks.
- Score: 5.276232626689567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: It is well-known that any matrix $A$ has an LU decomposition. Less well-known
is the fact that it has a 'Toeplitz decomposition' $A = T_1 T_2 \cdots T_r$
where $T_i$'s are Toeplitz matrices. We will prove that any continuous function
$f : \mathbb{R}^n \to \mathbb{R}^m$ has an approximation to arbitrary accuracy
by a neural network that takes the form $L_1 \sigma_1 U_1 \sigma_2 L_2 \sigma_3
U_2 \cdots L_r \sigma_{2r-1} U_r$, i.e., where the weight matrices alternate
between lower and upper triangular matrices, $\sigma_i(x) := \sigma(x - b_i)$
for some bias vector $b_i$, and the activation $\sigma$ may be chosen to be
essentially any uniformly continuous nonpolynomial function. The same result
also holds with Toeplitz matrices, i.e., $f \approx T_1 \sigma_1 T_2 \sigma_2
\cdots \sigma_{r-1} T_r$ to arbitrary accuracy, and likewise for Hankel
matrices. A consequence of our Toeplitz result is a fixed-width universal
approximation theorem for convolutional neural networks, which so far have only
arbitrary width versions. Since our results apply in particular to the case
when $f$ is a general neural network, we may regard them as LU and Toeplitz
decompositions of a neural network. The practical implication of our results is
that one may vastly reduce the number of weight parameters in a neural network
without sacrificing its power of universal approximation. We will present
several experiments on real data sets to show that imposing such structures on
the weight matrices sharply reduces the number of training parameters with
almost no noticeable effect on test accuracy.
Related papers
- Sample and Computationally Efficient Robust Learning of Gaussian Single-Index Models [37.42736399673992]
A single-index model (SIM) is a function of the form $sigma(mathbfwast cdot mathbfx)$, where $sigma: mathbbR to mathbbR$ is a known link function and $mathbfwast$ is a hidden unit vector.
We show that a proper learner attains $L2$-error of $O(mathrmOPT)+epsilon$, where $
arXiv Detail & Related papers (2024-11-08T17:10:38Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - A Unified Framework for Uniform Signal Recovery in Nonlinear Generative
Compressed Sensing [68.80803866919123]
Under nonlinear measurements, most prior results are non-uniform, i.e., they hold with high probability for a fixed $mathbfx*$ rather than for all $mathbfx*$ simultaneously.
Our framework accommodates GCS with 1-bit/uniformly quantized observations and single index models as canonical examples.
We also develop a concentration inequality that produces tighter bounds for product processes whose index sets have low metric entropy.
arXiv Detail & Related papers (2023-09-25T17:54:19Z) - Neural Networks Efficiently Learn Low-Dimensional Representations with
SGD [22.703825902761405]
We show that SGD-trained ReLU NNs can learn a single-index target of the form $y=f(langleboldsymbolu,boldsymbolxrangle) + epsilon$ by recovering the principal direction.
We also provide compress guarantees for NNs using the approximate low-rank structure produced by SGD.
arXiv Detail & Related papers (2022-09-29T15:29:10Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - A Law of Robustness beyond Isoperimetry [84.33752026418045]
We prove a Lipschitzness lower bound $Omega(sqrtn/p)$ of robustness of interpolating neural network parameters on arbitrary distributions.
We then show the potential benefit of overparametrization for smooth data when $n=mathrmpoly(d)$.
We disprove the potential existence of an $O(1)$-Lipschitz robust interpolating function when $n=exp(omega(d))$.
arXiv Detail & Related papers (2022-02-23T16:10:23Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks [29.03095282348978]
We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$.
We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
arXiv Detail & Related papers (2021-09-20T05:25:52Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z) - A Corrective View of Neural Networks: Representation, Memorization and
Learning [26.87238691716307]
We develop a corrective mechanism for neural network approximation.
We show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels.
We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions.
arXiv Detail & Related papers (2020-02-01T20:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.