Related papers: SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric Kernels

SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric Kernels

URL: http://arxiv.org/abs/2305.09028v2
Date: Sun, 9 Jul 2023 18:38:38 GMT
Title: SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric Kernels
Authors: Alexander Moreno, Jonathan Mei, Luke Walters
Abstract summary: Toeplitz Neural Networks (TNNs) are a recent sequence model with impressive results. We aim to reduce O(n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls. For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition.
Score: 69.47358238222586
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Toeplitz Neural Networks (TNNs) (Qin et. al. 2023) are a recent sequence model with impressive results. They require O(n log n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls. We aim to reduce both. We first note that the RPE is a non-SPD (symmetric positive definite) kernel and the Toeplitz matrices are pseudo-Gram matrices. Further 1) the learned kernels display spiky behavior near the main diagonals with otherwise smooth behavior; 2) the RPE MLP is slow. For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition. For the sparse component's action, we do a small 1D convolution. For the low rank component, we replace the RPE MLP with linear interpolation and use asymmetric Structured Kernel Interpolation (SKI) (Wilson et. al. 2015) for O(n) complexity: we provide rigorous error analysis. For causal models, "fast" causal masking (Katharopoulos et. al. 2020) negates SKI's benefits. Working in the frequency domain, we avoid an explicit decay bias. To enforce causality, we represent the kernel via the real part of its frequency response using the RPE and compute the imaginary part via a Hilbert transform. This maintains O(n log n) complexity but achieves an absolute speedup. Modeling the frequency response directly is also competitive for bidirectional training, using one fewer FFT. We set a speed state of the art on Long Range Arena (Tay et. al. 2020) with minimal score degradation.

Related papers

Tensor Decomposition Networks for Fast Machine Learning Interatomic Potential Computations [63.945006006152035]
tensor decomposition networks (TDNs) achieve competitive performance with dramatic speedup in computations.<n>We evaluate TDNs on PubChemQCR, a newly curated molecular relaxation dataset containing 105 million DFT-calculated snapshots.
arXiv Detail & Related papers (2025-07-01T18:46:27Z)
KHRONOS: a Kernel-Based Neural Architecture for Rapid, Resource-Efficient Scientific Computation [0.9355993154058798]
We introduce KHRONOS, an AI framework for model based, model free and model inversion tasks.<n>KHRONOS constructs continuously differentiable target fields with a hierarchical composition of per-dimension kernel expansions.<n>For inverse problems, KHRONOS facilitates rapid, iterative level set recovery in only a few forward evaluations, with sub-microsecond per sample latency.
arXiv Detail & Related papers (2025-05-19T16:29:07Z)
Stable Minima Cannot Overfit in Univariate ReLU Networks: Generalization by Large Step Sizes [29.466981306355066]
We show that gradient descent with a fixed learning rate $eta$ can only find local minima that represent smooth functions. We also prove a nearly-optimal MSE bound of $widetildeO(n-4/5)$ within the strict interior of the support of the $n$ data points.
arXiv Detail & Related papers (2024-06-10T22:57:27Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Learning with Norm Constrained, Over-parameterized, Two-layer Neural Networks [54.177130905659155]
Recent studies show that a reproducing kernel Hilbert space (RKHS) is not a suitable space to model functions by neural networks. In this paper, we study a suitable function space for over- parameterized two-layer neural networks with bounded norms.
arXiv Detail & Related papers (2024-04-29T15:04:07Z)
Generalization and Stability of Interpolating Neural Networks with Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime. We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$. With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z)
Improved techniques for deterministic l2 robustness [63.34032156196848]
Training convolutional neural networks (CNNs) with a strict 1-Lipschitz constraint under the $l_2$ norm is useful for adversarial robustness, interpretable gradients and stable training. We introduce a procedure to certify robustness of 1-Lipschitz CNNs by replacing the last linear layer with a 1-hidden layer. We significantly advance the state-of-the-art for standard and provable robust accuracies on CIFAR-10 and CIFAR-100.
arXiv Detail & Related papers (2022-11-15T19:10:12Z)
Bounding the Width of Neural Networks via Coupled Initialization -- A Worst Case Analysis [121.9821494461427]
We show how to significantly reduce the number of neurons required for two-layer ReLU networks. We also prove new lower bounds that improve upon prior work, and that under certain assumptions, are best possible.
arXiv Detail & Related papers (2022-06-26T06:51:31Z)
Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer. It accurately predicts the number of output tokens and extract hidden variables. It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z)
The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training [10.72393527290646]
We study phenomena in the context of two-layers neural networks in the neural tangent (NT) regime. We prove that as soon as $Ndgg n$, the test error is well approximated by one of kernel ridge regression with respect to the infinite-width kernel. The latter is in turn well approximated by the error ridge regression, whereby the regularization parameter is increased by a self-induced' term related to the high-degree components of the activation function.
arXiv Detail & Related papers (2020-07-25T01:51:13Z)
Random Reshuffling: Simple Analysis with Vast Improvements [9.169947558498535]
Random Reshuffling (RR) is an algorithm for minimizing finite-sum functions that utilizes iterative descent steps conjunction in with data reshuffuffling.
arXiv Detail & Related papers (2020-06-10T17:57:21Z)
Deep regularization and direct training of the inner layers of Neural Networks with Kernel Flows [0.609170287691728]
We introduce a new regularization method for Artificial Neural Networks (ANNs) based on Kernel Flows (KFs) KFs were introduced as a method for kernel selection in regression/kriging based on the minimization of the loss of accuracy incurred by halving the number of points in random batches of the dataset.
arXiv Detail & Related papers (2020-02-19T18:20:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.