Monarch: Expressive Structured Matrices for Efficient and Accurate
Training
- URL: http://arxiv.org/abs/2204.00595v1
- Date: Fri, 1 Apr 2022 17:37:29 GMT
- Title: Monarch: Expressive Structured Matrices for Efficient and Accurate
Training
- Authors: Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica
Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher R\'e
- Abstract summary: Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
- Score: 64.6871423399431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large neural networks excel in many domains, but they are expensive to train
and fine-tune. A popular approach to reduce their compute or memory
requirements is to replace dense weight matrices with structured ones (e.g.,
sparse, low-rank, Fourier transform). These methods have not seen widespread
adoption (1) in end-to-end training due to unfavorable efficiency--quality
tradeoffs, and (2) in dense-to-sparse fine-tuning due to lack of tractable
algorithms to approximate a given dense weight matrix. To address these issues,
we propose a class of matrices (Monarch) that is hardware-efficient (they are
parameterized as products of two block-diagonal matrices for better hardware
utilization) and expressive (they can represent many commonly used transforms).
Surprisingly, the problem of approximating a dense weight matrix with a Monarch
matrix, though nonconvex, has an analytical optimal solution. These properties
of Monarch matrices unlock new ways to train and fine-tune sparse and dense
models. We empirically validate that Monarch can achieve favorable
accuracy-efficiency tradeoffs in several end-to-end sparse training
applications: speeding up ViT and GPT-2 training on ImageNet classification and
Wikitext-103 language modeling by 2x with comparable model quality, and
reducing the error on PDE solving and MRI reconstruction tasks by 40%. In
sparse-to-dense training, with a simple technique called "reverse
sparsification," Monarch matrices serve as a useful intermediate representation
to speed up GPT-2 pretraining on OpenWebText by 2x without quality drop. The
same technique brings 23% faster BERT pretraining than even the very optimized
implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse
fine-tuning, as a proof-of-concept, our Monarch approximation algorithm speeds
up BERT fine-tuning on GLUE by 1.7x with comparable accuracy.
Related papers
- Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization [0.0]
We present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices.
Our method achieves state-of-the-art results, enabling unprecedented sparsification of neural networks.
arXiv Detail & Related papers (2024-09-27T15:48:39Z) - From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models.
We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Towards Higher Ranks via Adversarial Weight Pruning [34.602137305496335]
We propose a Rank-based PruninG (RPG) method to maintain the ranks of sparse weights in an adversarial manner.
RPG outperforms the state-of-the-art performance by 1.13% top-1 accuracy on ImageNet in ResNet-50 with 98% sparsity.
arXiv Detail & Related papers (2023-11-29T10:04:39Z) - RSC: Accelerating Graph Neural Networks Training via Randomized Sparse
Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware.
We explore trading off the computational precision to reduce the time complexity via sampling-based approximation.
We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z) - Pixelated Butterfly: Simple and Efficient Sparse training for Neural
Network Models [24.92486575100738]
We show that Pixelated Butterfly is 3x faster than butterfly and speeds up training to achieve favorable accuracy-efficiency tradeoffs.
On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5x faster than the dense-Mixer, Vision Transformer, and GPT-2 medium.
arXiv Detail & Related papers (2021-11-30T19:00:03Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Exact Backpropagation in Binary Weighted Networks with Group Weight
Transformations [0.0]
Quantization based model compression serves as high performing and fast approach for inference.
Models that constrain the weights to binary values enable efficient implementation of the ubiquitous dot product.
arXiv Detail & Related papers (2021-07-03T10:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.