Related papers: An Empirical Investigation of Matrix Factorization Methods for Pre-trained Transformers

An Empirical Investigation of Matrix Factorization Methods for Pre-trained Transformers

URL: http://arxiv.org/abs/2406.11307v1
Date: Mon, 17 Jun 2024 08:14:23 GMT
Title: An Empirical Investigation of Matrix Factorization Methods for Pre-trained Transformers
Authors: Ashim Gupta, Sina Mahdipour Saravani, P. Sadayappan, Vivek Srikumar,
Abstract summary: We present a comprehensive analysis of factorization based model compression techniques. We focus on comparing straightforward low-rank factorization against the recently introduced Monarch factorization. Our experiments lead to the surprising conclusion that straightforward low-rank factorization consistently outperforms Monarch factorization.
Score: 32.33602229853615
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing size of transformer-based models in NLP makes the question of compressing them important. In this work, we present a comprehensive analysis of factorization based model compression techniques. Specifically, we focus on comparing straightforward low-rank factorization against the recently introduced Monarch factorization, which exhibits impressive performance preservation on the GLUE benchmark. To mitigate stability issues associated with low-rank factorization of the matrices in pre-trained transformers, we introduce a staged factorization approach wherein layers are factorized one by one instead of being factorized simultaneously. Through this strategy we significantly enhance the stability and reliability of the compression process. Further, we introduce a simple block-wise low-rank factorization method, which has a close relationship to Monarch factorization. Our experiments lead to the surprising conclusion that straightforward low-rank factorization consistently outperforms Monarch factorization across both different compression ratios and six different text classification tasks.

Related papers

Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis [63.66763657191476]
We show that efficient numerical training and inference algorithms as low-rank computation have impressive performance for learning Transformer-based adaption. We analyze how magnitude-based models affect generalization while improving adaption. We conclude that proper magnitude-based has a slight on the testing performance.
arXiv Detail & Related papers (2024-06-24T23:00:58Z)
Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z)
Nonparametric Partial Disentanglement via Mechanism Sparsity: Sparse Actions, Interventions and Sparse Temporal Dependencies [58.179981892921056]
This work introduces a novel principle for disentanglement we call mechanism sparsity regularization. We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors. We show that the latent factors can be recovered by regularizing the learned causal graph to be sparse.
arXiv Detail & Related papers (2024-01-10T02:38:21Z)
DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization [12.277820111814691]
DSFormer is a simple alternative factorization scheme which expresses a target weight matrix as the product of a small dense and a semi-structured sparse matrix. Our approach is also to mainstream compressors and offers up to 50% additional compression when added to popular distilled, layer-shared and quantized transformers.
arXiv Detail & Related papers (2023-12-20T17:27:25Z)
Optimal vintage factor analysis with deflation varimax [18.50195604586597]
Vintage factor analysis aims to first find a low-dimensional representation of the original data, and then to seek a such that the rotated low-dimensional representation is scientifically meaningful. Perhaps most widely used vintage factor analysis is Principal Component Analysis (PCA) followed by varimax representation. In this paper, we propose a deflation-to-optimization procedure that solves each row matrix sequentially.
arXiv Detail & Related papers (2023-10-16T16:14:43Z)
Low-Rank Prune-And-Factorize for Language Model Compression [18.088550230146247]
Matrix factorization fails to retain satisfactory performance under moderate to high compression rate. We propose two techniques: sparsity-aware SVD and mixed-rank fine-tuning.
arXiv Detail & Related papers (2023-06-25T07:38:43Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
Identifiability in Exact Multilayer Sparse Matrix Factorization [0.0]
We prove that any matrix which is the product of L factors whose supports are exactly the so-called butterfly supports, admits a unique sparse factorization into L factors. This applies in particular to the Hadamard or the discrete Fourier transform matrix of size 2L.
arXiv Detail & Related papers (2021-10-04T07:50:51Z)
Understanding Implicit Regularization in Over-Parameterized Single Index Model [55.41685740015095]
We design regularization-free algorithms for the high-dimensional single index model. We provide theoretical guarantees for the induced implicit regularization phenomenon.
arXiv Detail & Related papers (2020-07-16T13:27:47Z)
Provable Online CP/PARAFAC Decomposition of a Structured Tensor via Dictionary Learning [18.464203215845373]
We consider the problem of factorizing a structured 3-mutation tensor into its constituent Polyadic (CP) factors. We develop a provable algorithm for structured tensor factorization.
arXiv Detail & Related papers (2020-06-30T00:31:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.