Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models
- URL: http://arxiv.org/abs/2408.10189v1
- Date: Mon, 19 Aug 2024 17:48:11 GMT
- Title: Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models
- Authors: Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, Albert Gu,
- Abstract summary: We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
- Score: 92.36510016591782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer architectures have become a dominant paradigm for domains like language modeling but suffer in many inference settings due to their quadratic-time self-attention. Recently proposed subquadratic architectures, such as Mamba, have shown promise, but have been pretrained with substantially less computational resources than the strongest Transformer models. In this work, we present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs). The key idea to our approach is that we can view both Transformers and SSMs as applying different forms of mixing matrices over the token sequences. We can thus progressively distill the Transformer architecture by matching different degrees of granularity in the SSM: first matching the mixing matrices themselves, then the hidden units at each block, and finally the end-to-end predictions. Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture (Phi-Mamba) using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models. MOHAWK allows models like SSMs to leverage computational resources invested in training Transformer-based architectures, highlighting a new avenue for building such models.
Related papers
- MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation [63.73137438677585]
MaskMamba is a novel hybrid model that combines Mamba and Transformer architectures.
It achieves a remarkable $54.44%$ improvement in inference speed at a resolution of $2048times 2048$ over Transformer.
arXiv Detail & Related papers (2024-09-30T04:28:55Z) - The Mamba in the Llama: Distilling and Accelerating Hybrid Models [76.64055251296548]
We show that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources.
The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks.
arXiv Detail & Related papers (2024-08-27T17:56:11Z) - An Empirical Study of Mamba-based Language Models [69.74383762508805]
Selective state-space models (SSMs) like Mamba overcome some shortcomings of Transformers.
We present a direct comparison between 8B-context Mamba, Mamba-2, and Transformer models trained on the same datasets.
We find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks.
arXiv Detail & Related papers (2024-06-12T05:25:15Z) - Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality [31.985243136674146]
We show that state-space models (SSMs) such as Mamba have been shown to match or outperform Transformers at small to medium scale.
Our state space duality (SSD) framework allows us to design a new architecture (Mamba-2) whose core layer is an a refinement of Mamba's selective SSM that is 2-8X faster.
arXiv Detail & Related papers (2024-05-31T17:50:01Z) - Mamba: Linear-Time Sequence Modeling with Selective State Spaces [31.985243136674146]
Foundation models are almost universally based on the Transformer architecture and its core attention module.
We identify that a key weakness of such models is their inability to perform content-based reasoning.
We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even blocks (Mamba)
As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics.
arXiv Detail & Related papers (2023-12-01T18:01:34Z) - Converting Transformers to Polynomial Form for Secure Inference Over
Homomorphic Encryption [45.00129952368691]
Homomorphic Encryption (HE) has emerged as one of the most promising approaches in deep learning.
We introduce the first transformer, providing the first demonstration of secure inference over HE with transformers.
Our models yield results comparable to traditional methods, bridging the performance gap with transformers of similar scale and underscoring the viability of HE for state-of-the-art applications.
arXiv Detail & Related papers (2023-11-15T00:23:58Z) - Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture [31.763186154430347]
We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension.
As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style classification, and causal GPT-style language modeling.
For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in GLUE quality with up to 27% fewer parameters, and up to 9.1$times higher throughput at sequence length 4K
arXiv Detail & Related papers (2023-10-18T17:06:22Z) - MatFormer: Nested Transformer for Elastic Inference [94.1789252941718]
MatFormer is a nested Transformer architecture designed to offer elasticity in a variety of deployment constraints.
We show that a 2.6B decoder-only MatFormer language model (MatLM) allows us to extract smaller models spanning from 1.5B to 2.6B.
We also observe that smaller encoders extracted from a universal MatFormer-based ViT (MatViT) encoder preserve the metric-space structure for adaptive large-scale retrieval.
arXiv Detail & Related papers (2023-10-11T17:57:14Z) - Ensemble Transformer for Efficient and Accurate Ranking Tasks: an
Application to Question Answering Systems [99.13795374152997]
We propose a neural network designed to distill an ensemble of large transformers into a single smaller model.
An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads.
Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
arXiv Detail & Related papers (2022-01-15T06:21:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.