Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions
- URL: http://arxiv.org/abs/2310.18780v1
- Date: Sat, 28 Oct 2023 18:40:03 GMT
- Title: Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions
- Authors: Stefano Massaroli, Michael Poli, Daniel Y. Fu, Hermann Kumbong, Rom N.
Parnichkun, Aman Timalsina, David W. Romero, Quinn McIntyre, Beidi Chen, Atri
Rudra, Ce Zhang, Christopher Re, Stefano Ermon, Yoshua Bengio
- Abstract summary: Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers.
In this paper, we seek to enable $mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture.
- Score: 101.08706223326928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in attention-free sequence models rely on convolutions as
alternatives to the attention operator at the core of Transformers. In
particular, long convolution sequence models have achieved state-of-the-art
performance in many domains, but incur a significant cost during
auto-regressive inference workloads -- naively requiring a full pass (or
caching of activations) over the input sequence for each generated token --
similarly to attention-based models. In this paper, we seek to enable $\mathcal
O(1)$ compute and memory cost per token in any pre-trained long convolution
architecture to reduce memory footprint and increase throughput during
generation. Concretely, our methods consist in extracting low-dimensional
linear state-space models from each convolution layer, building upon rational
interpolation and model-order reduction techniques. We further introduce
architectural improvements to convolution-based layers such as Hyena: by
weight-tying the filters across channels into heads, we achieve higher
pre-training quality and reduce the number of filters to be distilled. The
resulting model achieves 10x higher throughput than Transformers and 1.5x
higher than Hyena at 1.3B parameters, without any loss in quality after
distillation.
Related papers
- CAT Pruning: Cluster-Aware Token Pruning For Text-to-Image Diffusion Models [5.406829638216823]
Diffusion models have revolutionized generative tasks, especially in the domain of text-to-image synthesis.
However, their iterative denoising process demands substantial computational resources.
We present a novel acceleration strategy that integrates token-level pruning with caching techniques to tackle this computational challenge.
arXiv Detail & Related papers (2025-02-01T13:46:02Z) - Merging Feed-Forward Sublayers for Compressed Transformers [16.746335565636976]
We present a novel approach to model compression by merging similar parameter groups within a model.
Specifically, we select, align, and merge separate feed-forward sublayers in Transformer models.
We demonstrate performance comparable to the original models while combining more than a third of model feed-forward sublayers.
arXiv Detail & Related papers (2025-01-10T17:25:11Z) - LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Sequence Modeling with Multiresolution Convolutional Memory [27.218134279968062]
We build a new building block for sequence modeling called a MultiresLayer.
The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence.
Our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks.
arXiv Detail & Related papers (2023-05-02T17:50:54Z) - Latent Autoregressive Source Separation [5.871054749661012]
This paper introduces vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models.
Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens.
arXiv Detail & Related papers (2023-01-09T17:32:00Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.