SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric
Kernels
- URL: http://arxiv.org/abs/2305.09028v2
- Date: Sun, 9 Jul 2023 18:38:38 GMT
- Title: SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric
Kernels
- Authors: Alexander Moreno, Jonathan Mei, Luke Walters
- Abstract summary: Toeplitz Neural Networks (TNNs) are a recent sequence model with impressive results.
We aim to reduce O(n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls.
For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition.
- Score: 69.47358238222586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Toeplitz Neural Networks (TNNs) (Qin et. al. 2023) are a recent sequence
model with impressive results. They require O(n log n) computational complexity
and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and
decay bias calls. We aim to reduce both. We first note that the RPE is a
non-SPD (symmetric positive definite) kernel and the Toeplitz matrices are
pseudo-Gram matrices. Further 1) the learned kernels display spiky behavior
near the main diagonals with otherwise smooth behavior; 2) the RPE MLP is slow.
For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix
decomposition. For the sparse component's action, we do a small 1D convolution.
For the low rank component, we replace the RPE MLP with linear interpolation
and use asymmetric Structured Kernel Interpolation (SKI) (Wilson et. al. 2015)
for O(n) complexity: we provide rigorous error analysis. For causal models,
"fast" causal masking (Katharopoulos et. al. 2020) negates SKI's benefits.
Working in the frequency domain, we avoid an explicit decay bias. To enforce
causality, we represent the kernel via the real part of its frequency response
using the RPE and compute the imaginary part via a Hilbert transform. This
maintains O(n log n) complexity but achieves an absolute speedup. Modeling the
frequency response directly is also competitive for bidirectional training,
using one fewer FFT. We set a speed state of the art on Long Range Arena (Tay
et. al. 2020) with minimal score degradation.
Related papers
- Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions.
We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks.
In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z) - Generalization and Stability of Interpolating Neural Networks with
Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime.
We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$.
With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z) - Improved techniques for deterministic l2 robustness [63.34032156196848]
Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the $l_2$ norm is useful for adversarial robustness, interpretable gradients and stable training.
We introduce a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer.
We significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-11-15T19:10:12Z) - Bounding the Width of Neural Networks via Coupled Initialization -- A
Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks.
We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - The Interpolation Phase Transition in Neural Networks: Memorization and
Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime.
We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel.
The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z) - Random Reshuffling: Simple Analysis with Vast Improvements [9.169947558498535]
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative descent steps conjunction in with data reshuffuffling.
arXiv Detail & Related papers (2020-06-10T17:57:21Z) - Deep regularization and direct training of the inner layers of Neural
Networks with Kernel Flows [0.609170287691728]
We introduce a new regularization method for Artificial Neural Networks (ANNs) based on Kernel Flows (KFs)
KFs were introduced as a method for kernel selection in regression/kriging based on the minimization of the loss of accuracy incurred by halving the number of points in random batches of the dataset.
arXiv Detail & Related papers (2020-02-19T18:20:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.