Deep regularization and direct training of the inner layers of Neural
Networks with Kernel Flows
- URL: http://arxiv.org/abs/2002.08335v2
- Date: Fri, 7 Aug 2020 03:47:26 GMT
- Title: Deep regularization and direct training of the inner layers of Neural
Networks with Kernel Flows
- Authors: Gene Ryan Yoo and Houman Owhadi
- Abstract summary: We introduce a new regularization method for Artificial Neural Networks (ANNs) based on Kernel Flows (KFs)
KFs were introduced as a method for kernel selection in regression/kriging based on the minimization of the loss of accuracy incurred by halving the number of points in random batches of the dataset.
- Score: 0.609170287691728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new regularization method for Artificial Neural Networks
(ANNs) based on Kernel Flows (KFs). KFs were introduced as a method for kernel
selection in regression/kriging based on the minimization of the loss of
accuracy incurred by halving the number of interpolation points in random
batches of the dataset. Writing $f_\theta(x) = \big(f^{(n)}_{\theta_n}\circ
f^{(n-1)}_{\theta_{n-1}} \circ \dots \circ f^{(1)}_{\theta_1}\big)(x)$ for the
functional representation of compositional structure of the ANN, the inner
layers outputs $h^{(i)}(x) = \big(f^{(i)}_{\theta_i}\circ
f^{(i-1)}_{\theta_{i-1}} \circ \dots \circ f^{(1)}_{\theta_1}\big)(x)$ define a
hierarchy of feature maps and kernels $k^{(i)}(x,x')=\exp(- \gamma_i
\|h^{(i)}(x)-h^{(i)}(x')\|_2^2)$. When combined with a batch of the dataset
these kernels produce KF losses $e_2^{(i)}$ (the $L^2$ regression error
incurred by using a random half of the batch to predict the other half)
depending on parameters of inner layers $\theta_1,\ldots,\theta_i$ (and
$\gamma_i$). The proposed method simply consists in aggregating a subset of
these KF losses with a classical output loss. We test the proposed method on
CNNs and WRNs without alteration of structure nor output classifier and report
reduced test errors, decreased generalization gaps, and increased robustness to
distribution shift without significant increase in computational complexity. We
suspect that these results might be explained by the fact that while
conventional training only employs a linear functional (a generalized moment)
of the empirical distribution defined by the dataset and can be prone to
trapping in the Neural Tangent Kernel regime (under over-parameterizations),
the proposed loss function (defined as a nonlinear functional of the empirical
distribution) effectively trains the underlying kernel defined by the CNN
beyond regressing the data with that kernel.
Related papers
- Enhanced Feature Learning via Regularisation: Integrating Neural Networks and Kernel Methods [0.0]
We introduce an efficient method for the estimator, called Brownian Kernel Neural Network (BKerNN)
We show that BKerNN's expected risk converges to the minimal risk with explicit high-probability rates of $O( min((d/n)1/2, n-1/6)$ (up to logarithmic factors)
arXiv Detail & Related papers (2024-07-24T13:46:50Z) - Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric
Kernels [69.47358238222586]
Toeplitz Neural Networks (TNNs) are a recent sequence model with impressive results.
We aim to reduce O(n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls.
For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition.
arXiv Detail & Related papers (2023-05-15T21:25:35Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Efficient Dataset Distillation Using Random Feature Approximation [109.07737733329019]
We propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel.
Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU.
Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets.
arXiv Detail & Related papers (2022-10-21T15:56:13Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Deformed semicircle law and concentration of nonlinear random matrices
for ultra-wide neural networks [29.03095282348978]
We study the limiting spectral distributions of two empirical kernel matrices associated with $f(X)$.
We show that random feature regression induced by the empirical kernel achieves the same performance as its limiting kernel regression under the ultra-wide regime.
arXiv Detail & Related papers (2021-09-20T05:25:52Z) - Beyond Lazy Training for Over-parameterized Tensor Decomposition [69.4699995828506]
We show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
Our results show that gradient descent on over-parametrized objective could go beyond the lazy training regime and utilize certain low-rank structure in the data.
arXiv Detail & Related papers (2020-10-22T00:32:12Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z) - Learning Over-Parametrized Two-Layer ReLU Neural Networks beyond NTK [58.5766737343951]
We consider the dynamic of descent for learning a two-layer neural network.
We show that an over-parametrized two-layer neural network can provably learn with gradient loss at most ground with Tangent samples.
arXiv Detail & Related papers (2020-07-09T07:09:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.