Related papers: Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

URL: http://arxiv.org/abs/2410.12982v1
Date: Wed, 16 Oct 2024 19:23:46 GMT
Title: Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond
Authors: Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade,
Abstract summary: We propose a method for speeding up LCSMs' exact inference to quasilinear $O(Llog2L)$ time. We provide a proof of concept implementation for Hyena, which gets up to $1.6times$ end-to-end improvement over standard inference.
Score: 7.280765035096294
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear $O(L\log^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $1.6\times$ end-to-end improvement over standard inference by improving $50\times$ within the position-mixing part.

Related papers

Better Rates for Random Task Orderings in Continual Linear Models [50.11453013647086]
We analyze the forgetting, i.e., loss on previously seen tasks, after $k$ iterations. We develop novel last-iterate bounds in the realizable least squares setup, and apply them to derive new results for continual learning. We prove for the first time that randomization alone, with no task repetition, can prevent catastrophic forgetting in sufficiently long task.
arXiv Detail & Related papers (2025-04-06T18:39:45Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation [53.17668583030862]
We study infinite-horizon average-reward Markov decision processes (AMDPs) in the context of general function approximation. We propose a novel algorithmic framework named Local-fitted Optimization with OPtimism (LOOP) We show that LOOP achieves a sublinear $tildemathcalO(mathrmpoly(d, mathrmsp(V*)) sqrtTbeta )$ regret, where $d$ and $beta$ correspond to AGEC and log-covering number of the hypothesis class respectively
arXiv Detail & Related papers (2024-04-19T06:24:22Z)
HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z)
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models [52.454274602380124]
Diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising. We propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block. Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features.
arXiv Detail & Related papers (2023-11-27T12:59:52Z)
Tractable Bounding of Counterfactual Queries by Knowledge Compilation [51.47174989680976]
We discuss the problem of bounding partially identifiable queries, such as counterfactuals, in Pearlian structural causal models. A recently proposed iterated EM scheme yields an inner approximation of those bounds by sampling the initialisation parameters. We show how a single symbolic knowledge compilation allows us to obtain the circuit structure with symbolic parameters to be replaced by their actual values.
arXiv Detail & Related papers (2023-10-05T07:10:40Z)
DCT-Former: Efficient Self-Attention with Discrete Cosine Transform [4.622165486890318]
An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
arXiv Detail & Related papers (2022-03-02T15:25:27Z)
Fast 2D Convolutions and Cross-Correlations Using Scalable Architectures [2.2940141855172027]
The basic idea is to map 2D convolutions and cross-correlations to a collection of 1D convolutions and cross-correlations in the transform domain. The approach uses scalable architectures that can be fitted into modern FPGA and Zynq-SOC devices.
arXiv Detail & Related papers (2021-12-24T22:34:51Z)
The connection between time-local and time-nonlocal perturbation expansions [0.0]
We show that a series for the kernel $mathcalK$ to be translated directly into a corresponding series for the more complicated generator $mathcalG$. We illustrate this for leading and next-to-leading order calculations of $mathcalK$ and $mathcalG$ for the single impurity Anderson model.
arXiv Detail & Related papers (2021-07-19T15:05:29Z)
Distributed stochastic optimization with large delays [59.95552973784946]
One of the most widely used methods for solving large-scale optimization problems is distributed asynchronous gradient descent (DASGD) We show that DASGD converges to a global optimal implementation model under same delay assumptions.
arXiv Detail & Related papers (2021-07-06T21:59:49Z)
Reusing Combinatorial Structure: Faster Iterative Projections over Submodular Base Polytopes [7.734726150561089]
We develop a toolkit to speed up the computation of projections using both discrete and continuous perspectives. For the special case of cardinality based submodular polytopes, we improve the runtime of computing certain Bregman projections by a factor of $Omega(n/log(n))$.
arXiv Detail & Related papers (2021-06-22T17:29:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.