Nonparametric Learning of Two-Layer ReLU Residual Units
- URL: http://arxiv.org/abs/2008.07648v2
- Date: Fri, 11 Jun 2021 17:03:23 GMT
- Title: Nonparametric Learning of Two-Layer ReLU Residual Units
- Authors: Zhunxuan Wang, Linyun He, Chunchuan Lyu and Shay B. Cohen
- Abstract summary: We describe an algorithm that learns two-layer residual units with rectified linear unit (ReLU) activation.
We design layer-wise objectives as functionals whose analytic minimizers express the exact ground-truth network in terms of its parameters and nonlinearities.
We prove the statistical strong consistency of our algorithm, and demonstrate the robustness and sample efficiency of our algorithm by experiments.
- Score: 22.870658194212744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe an algorithm that learns two-layer residual units with rectified
linear unit (ReLU) activation: suppose the input $\mathbf{x}$ is from a
distribution with support space $\mathbb{R}^d$ and the ground-truth generative
model is such a residual unit, given by \[\mathbf{y}=
\boldsymbol{B}^\ast\left[\left(\boldsymbol{A}^\ast\mathbf{x}\right)^+ +
\mathbf{x}\right]\text{,}\] where ground-truth network parameters
$\boldsymbol{A}^\ast \in \mathbb{R}^{d\times d}$ is a nonnegative full-rank
matrix and $\boldsymbol{B}^\ast \in \mathbb{R}^{m\times d}$ is full-rank with
$m \geq d$ and for $\mathbf{c} \in \mathbb{R}^d$, $[\mathbf{c}^{+}]_i =
\max\{0, c_i\}$. We design layer-wise objectives as functionals whose analytic
minimizers express the exact ground-truth network in terms of its parameters
and nonlinearities. Following this objective landscape, learning residual units
from finite samples can be formulated using convex optimization of a
nonparametric function: for each layer, we first formulate the corresponding
empirical risk minimization (ERM) as a positive semi-definite quadratic program
(QP), then we show the solution space of the QP can be equivalently determined
by a set of linear inequalities, which can then be efficiently solved by linear
programming (LP). We further prove the statistical strong consistency of our
algorithm, and demonstrate the robustness and sample efficiency of our
algorithm by experiments.
Related papers
- Provable Acceleration of Nesterov's Accelerated Gradient for Rectangular Matrix Factorization and Linear Neural Networks [46.04785603483612]
We prove that Nesterov's accelerated gradient attains an complexity $O(kappalogfrac1epsilon)$.
In particular, we prove that NAG can also attain an accelerated linear convergence rate.
arXiv Detail & Related papers (2024-10-12T20:33:37Z) - How to Inverting the Leverage Score Distribution? [16.744561210470632]
Despite leverage scores being widely used as a tool, in this paper, we study a novel problem, namely the inverting leverage score.
We use iterative shrinking and the induction hypothesis to ensure global convergence rates for the Newton method.
This important study on inverting statistical leverage opens up numerous new applications in interpretation, data recovery, and security.
arXiv Detail & Related papers (2024-04-21T21:36:42Z) - Provably learning a multi-head attention layer [55.2904547651831]
Multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models.
In this work, we initiate the study of provably learning a multi-head attention layer from random examples.
We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable.
arXiv Detail & Related papers (2024-02-06T15:39:09Z) - Learning linear dynamical systems under convex constraints [4.4351901934764975]
We consider the problem of identification of linear dynamical systems from $T$ samples of a single trajectory.
$A*$ can be reliably estimated for values $T$ smaller than what is needed for unconstrained setting.
arXiv Detail & Related papers (2023-03-27T11:49:40Z) - Learning a Single Neuron with Adversarial Label Noise via Gradient
Descent [50.659479930171585]
We study a function of the form $mathbfxmapstosigma(mathbfwcdotmathbfx)$ for monotone activations.
The goal of the learner is to output a hypothesis vector $mathbfw$ that $F(mathbbw)=C, epsilon$ with high probability.
arXiv Detail & Related papers (2022-06-17T17:55:43Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Beyond Independent Measurements: General Compressed Sensing with GNN
Application [4.924126492174801]
We consider the problem of recovering a structured signal $mathbfx in mathbbRn$ from noisy cone observations.
We show that the effective rank of $mathbfB$ may be used as a surrogate for the number of measurements.
arXiv Detail & Related papers (2021-10-30T20:35:56Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.