Monarch: Expressive Structured Matrices for Efficient and Accurate
Training
- URL: http://arxiv.org/abs/2204.00595v1
- Date: Fri, 1 Apr 2022 17:37:29 GMT
- Title: Monarch: Expressive Structured Matrices for Efficient and Accurate
Training
- Authors: Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica
Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher R\'e
- Abstract summary: Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
- Score: 64.6871423399431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large neural networks excel in many domains, but they are expensive to train
and fine-tune. A popular approach to reduce their compute or memory
requirements is to replace dense weight matrices with structured ones (e.g.,
sparse, low-rank, Fourier transform). These methods have not seen widespread
adoption (1) in end-to-end training due to unfavorable efficiency--quality
tradeoffs, and (2) in dense-to-sparse fine-tuning due to lack of tractable
algorithms to approximate a given dense weight matrix. To address these issues,
we propose a class of matrices (Monarch) that is hardware-efficient (they are
parameterized as products of two block-diagonal matrices for better hardware
utilization) and expressive (they can represent many commonly used transforms).
Surprisingly, the problem of approximating a dense weight matrix with a Monarch
matrix, though nonconvex, has an analytical optimal solution. These properties
of Monarch matrices unlock new ways to train and fine-tune sparse and dense
models. We empirically validate that Monarch can achieve favorable
accuracy-efficiency tradeoffs in several end-to-end sparse training
applications: speeding up ViT and GPT-2 training on ImageNet classification and
Wikitext-103 language modeling by 2x with comparable model quality, and
reducing the error on PDE solving and MRI reconstruction tasks by 40%. In
sparse-to-dense training, with a simple technique called "reverse
sparsification," Monarch matrices serve as a useful intermediate representation
to speed up GPT-2 pretraining on OpenWebText by 2x without quality drop. The
same technique brings 23% faster BERT pretraining than even the very optimized
implementation from Nvidia that set the MLPerf 1.1 record. In dense-to-sparse
fine-tuning, as a proof-of-concept, our Monarch approximation algorithm speeds
up BERT fine-tuning on GLUE by 1.7x with comparable accuracy.
Related papers
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models.
We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z) - Optimization-based Structural Pruning for Large Language Models without Back-Propagation [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models (LLMs)
Our method learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Compute Better Spent: Replacing Dense Layers with Structured Matrices [77.61728033234233]
We identify more efficient alternatives to dense matrices, as exemplified by the success of convolutional networks in the image domain.
We show that different structures often require drastically different initialization scales and learning rates, which are crucial to performance.
We propose a novel matrix family containing Monarch matrices, the Block-Train, which we show performs better than dense for the same compute on multiple tasks.
arXiv Detail & Related papers (2024-06-10T13:25:43Z) - Towards Higher Ranks via Adversarial Weight Pruning [34.602137305496335]
We propose a Rank-based PruninG (RPG) method to maintain the ranks of sparse weights in an adversarial manner.
RPG outperforms the state-of-the-art performance by 1.13% top-1 accuracy on ImageNet in ResNet-50 with 98% sparsity.
arXiv Detail & Related papers (2023-11-29T10:04:39Z) - RSC: Accelerating Graph Neural Networks Training via Randomized Sparse
Computations [56.59168541623729]
Training graph neural networks (GNNs) is time consuming because sparse graph-based operations are hard to be accelerated by hardware.
We explore trading off the computational precision to reduce the time complexity via sampling-based approximation.
We propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations.
arXiv Detail & Related papers (2022-10-19T17:25:33Z) - Learning to Optimize Quasi-Newton Methods [22.504971951262004]
This paper introduces a novel machine learning called LODO, which tries to online meta-learn the best preconditioner during optimization.
Unlike other L2O methods, LODO does not require any meta-training on a training task distribution.
We show that our gradient approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians.
arXiv Detail & Related papers (2022-10-11T03:47:14Z) - Pixelated Butterfly: Simple and Efficient Sparse training for Neural
Network Models [24.92486575100738]
We show that Pixelated Butterfly is 3x faster than butterfly and speeds up training to achieve favorable accuracy-efficiency tradeoffs.
On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.5x faster than the dense-Mixer, Vision Transformer, and GPT-2 medium.
arXiv Detail & Related papers (2021-11-30T19:00:03Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - Exact Backpropagation in Binary Weighted Networks with Group Weight
Transformations [0.0]
Quantization based model compression serves as high performing and fast approach for inference.
Models that constrain the weights to binary values enable efficient implementation of the ubiquitous dot product.
arXiv Detail & Related papers (2021-07-03T10:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.