Pixelated Butterfly: Simple and Efficient Sparse training for Neural
Network Models
- URL: http://arxiv.org/abs/2112.00029v1
- Date: Tue, 30 Nov 2021 19:00:03 GMT
- Title: Pixelated Butterfly: Simple and Efficient Sparse training for Neural
Network Models
- Authors: Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri
Rudra, Christopher Re
- Abstract summary: We show that Pixelated Butterfly is 3x faster than butterfly and speeds up training to achieve favorable accuracy-efficiency tradeoffs.
On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5x faster than the dense-Mixer, Vision Transformer, and GPT-2 medium.
- Score: 24.92486575100738
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Overparameterized neural networks generalize well but are expensive to train.
Ideally, one would like to reduce their computational cost while retaining
their generalization benefits. Sparse model training is a simple and promising
approach to achieve this, but there remain challenges as existing methods
struggle with accuracy loss, slow training runtime, or difficulty in
sparsifying all model components. The core problem is that searching for a
sparsity mask over a discrete set of sparse matrices is difficult and
expensive. To address this, our main insight is to optimize over a continuous
superset of sparse matrices with a fixed structure known as products of
butterfly matrices. As butterfly matrices are not hardware efficient, we
propose simple variants of butterfly (block and flat) to take advantage of
modern hardware. Our method (Pixelated Butterfly) uses a simple fixed sparsity
pattern based on flat block butterfly and low-rank matrices to sparsify most
network layers (e.g., attention, MLP). We empirically validate that Pixelated
Butterfly is 3x faster than butterfly and speeds up training to achieve
favorable accuracy--efficiency tradeoffs. On the ImageNet classification and
WikiText-103 language modeling tasks, our sparse models train up to 2.5x faster
than the dense MLP-Mixer, Vision Transformer, and GPT-2 medium with no drop in
accuracy.
Related papers
- Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization [0.0]
We present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices.
Our method achieves state-of-the-art results, enabling unprecedented sparsification of neural networks.
arXiv Detail & Related papers (2024-09-27T15:48:39Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - Lite it fly: An All-Deformable-Butterfly Network [7.8460795568982435]
Most deep neural networks (DNNs) consist fundamentally of convolutional and/or fully connected layers.
The lately proposed deformable butterfly (DeBut) decomposes the filter matrix into generalized, butterflylike factors.
This work reveals an intimate link between DeBut and a systematic hierarchy of depthwise and pointwise convolutions.
arXiv Detail & Related papers (2023-11-14T12:41:22Z) - ButterflyFlow: Building Invertible Layers with Butterfly Matrices [80.83142511616262]
We propose a new family of invertible linear layers based on butterfly layers.
Based on our invertible butterfly layers, we construct a new class of normalizing flow models called ButterflyFlow.
arXiv Detail & Related papers (2022-09-28T01:58:18Z) - Training Your Sparse Neural Network Better with Any Mask [106.134361318518]
Pruning large neural networks to create high-quality, independently trainable sparse masks is desirable.
In this paper we demonstrate an alternative opportunity: one can customize the sparse training techniques to deviate from the default dense network training protocols.
Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks.
arXiv Detail & Related papers (2022-06-26T00:37:33Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - A Fast Post-Training Pruning Framework for Transformers [74.59556951906468]
Pruning is an effective way to reduce the huge inference cost of large Transformer models.
Prior work on model pruning requires retraining the model.
We propose a fast post-training pruning framework for Transformers that does not require any retraining.
arXiv Detail & Related papers (2022-03-29T07:41:11Z) - Sparse Linear Networks with a Fixed Butterfly Structure: Theory and
Practice [4.3400407844814985]
We propose to replace a dense linear layer in any neural network by an architecture based on the butterfly network.
In a collection of experiments, including supervised prediction on both the NLP and vision data, we show that this not only produces results that match and at times outperform existing well-known architectures.
arXiv Detail & Related papers (2020-07-17T09:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.