Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions
- URL: http://arxiv.org/abs/2310.18780v1
- Date: Sat, 28 Oct 2023 18:40:03 GMT
- Title: Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions
- Authors: Stefano Massaroli, Michael Poli, Daniel Y. Fu, Hermann Kumbong, Rom N.
Parnichkun, Aman Timalsina, David W. Romero, Quinn McIntyre, Beidi Chen, Atri
Rudra, Ce Zhang, Christopher Re, Stefano Ermon, Yoshua Bengio
- Abstract summary: Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers.
In this paper, we seek to enable $mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture.
- Score: 101.08706223326928
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in attention-free sequence models rely on convolutions as
alternatives to the attention operator at the core of Transformers. In
particular, long convolution sequence models have achieved state-of-the-art
performance in many domains, but incur a significant cost during
auto-regressive inference workloads -- naively requiring a full pass (or
caching of activations) over the input sequence for each generated token --
similarly to attention-based models. In this paper, we seek to enable $\mathcal
O(1)$ compute and memory cost per token in any pre-trained long convolution
architecture to reduce memory footprint and increase throughput during
generation. Concretely, our methods consist in extracting low-dimensional
linear state-space models from each convolution layer, building upon rational
interpolation and model-order reduction techniques. We further introduce
architectural improvements to convolution-based layers such as Hyena: by
weight-tying the filters across channels into heads, we achieve higher
pre-training quality and reduce the number of filters to be distilled. The
resulting model achieves 10x higher throughput than Transformers and 1.5x
higher than Hyena at 1.3B parameters, without any loss in quality after
distillation.
Related papers
- LinFusion: 1 GPU, 1 Minute, 16K Image [71.44735417472043]
We introduce a low-rank approximation of a wide spectrum of popular linear token mixers.
We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD.
Experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation.
arXiv Detail & Related papers (2024-09-03T17:54:39Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Sequence Modeling with Multiresolution Convolutional Memory [27.218134279968062]
We build a new building block for sequence modeling called a MultiresLayer.
The key component of our model is the multiresolution convolution, capturing multiscale trends in the input sequence.
Our model yields state-of-the-art performance on a number of sequence classification and autoregressive density estimation tasks.
arXiv Detail & Related papers (2023-05-02T17:50:54Z) - Latent Autoregressive Source Separation [5.871054749661012]
This paper introduces vector-quantized Latent Autoregressive Source Separation (i.e., de-mixing an input signal into its constituent sources) without requiring additional gradient-based optimization or modifications of existing models.
Our separation method relies on the Bayesian formulation in which the autoregressive models are the priors, and a discrete (non-parametric) likelihood function is constructed by performing frequency counts over latent sums of addend tokens.
arXiv Detail & Related papers (2023-01-09T17:32:00Z) - AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of
Large-Scale Pre-Trained Language Models [19.640997611256168]
We propose AlphaTuning, consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task.
Specifically, AlphaTuning works by employing binary-coding quantization, which factorizes the full-precision parameters into binary parameters and a separate set of scaling factors.
We demonstrate that AlphaTuning, when applied to GPT-2 and OPT, performs competitively with full fine-tuning on a variety of downstream tasks while achieving >10x compression ratio under 4-bit quantization and >1,000x reduction in the number of trainable parameters.
arXiv Detail & Related papers (2022-10-08T00:36:00Z) - Megapixel Image Generation with Step-Unrolled Denoising Autoencoders [5.145313322824774]
We propose a combination of techniques to push sample resolutions higher and reduce computational requirements for training and sampling.
These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model.
Our proposed framework scales to high-resolutions ($1024 times 1024$) and trains quickly (
arXiv Detail & Related papers (2022-06-24T15:47:42Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.