Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank
Approximation
- URL: http://arxiv.org/abs/2009.05647v2
- Date: Sat, 26 Sep 2020 12:24:06 GMT
- Title: Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank
Approximation
- Authors: Murad Tukan and Alaa Maalouf and Matan Weksler and Dan Feldman
- Abstract summary: A common technique for compressing a neural network is to compute the $k$-rank $ell$ approximation $A_k,2$ of the matrix $AinmathbbRntimes d$ that corresponds to a fully connected layer (or embedding layer)
Here, $d$ is the number of the neurons in the layer, $n$ is the number in the next one, and $A_k,2$ can be stored in $O(n+d)k)$ memory instead of $O(nd)$.
This
- Score: 23.06440095688755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common technique for compressing a neural network is to compute the
$k$-rank $\ell_2$ approximation $A_{k,2}$ of the matrix
$A\in\mathbb{R}^{n\times d}$ that corresponds to a fully connected layer (or
embedding layer). Here, $d$ is the number of the neurons in the layer, $n$ is
the number in the next one, and $A_{k,2}$ can be stored in $O((n+d)k)$ memory
instead of $O(nd)$.
This $\ell_2$-approximation minimizes the sum over every entry to the power
of $p=2$ in the matrix $A - A_{k,2}$, among every matrix
$A_{k,2}\in\mathbb{R}^{n\times d}$ whose rank is $k$. While it can be computed
efficiently via SVD, the $\ell_2$-approximation is known to be very sensitive
to outliers ("far-away" rows). Hence, machine learning uses e.g. Lasso
Regression, $\ell_1$-regularization, and $\ell_1$-SVM that use the
$\ell_1$-norm.
This paper suggests to replace the $k$-rank $\ell_2$ approximation by
$\ell_p$, for $p\in [1,2]$. We then provide practical and provable
approximation algorithms to compute it for any $p\geq1$, based on modern
techniques in computational geometry.
Extensive experimental results on the GLUE benchmark for compressing BERT,
DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage. For example,
our approach achieves $28\%$ compression of RoBERTa's embedding layer with only
$0.63\%$ additive drop in the accuracy (without fine-tuning) in average over
all tasks in GLUE, compared to $11\%$ drop using the existing
$\ell_2$-approximation. Open code is provided for reproducing and extending our
results.
Related papers
- LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions [54.54897832889028]
We show that for any $K$, there is a universal set" $U subset [n]$ of size independent of $n$, such that for any $Q$ and any row $i$, the large attention scores $A_i,j$ in row $i$ of $A$ all have $jin U$.
We empirically show the benefits of our scheme for vision transformers, showing how to train new models that use our universal set while training as well.
arXiv Detail & Related papers (2024-10-07T19:47:13Z) - Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Optimal Embedding Dimension for Sparse Subspace Embeddings [4.042707434058959]
A random $mtimes n$ matrix $S$ is an oblivious subspace embedding (OSE)
We show that an $mtimes n$ random matrix $S$ with $mgeq (1+theta)d$ is an oblivious subspace embedding with $epsilon = O_theta(1)$.
We use this to construct the first oblivious subspace embedding with $O(d)$ embedding dimension that can be applied faster than current matrix multiplication time.
arXiv Detail & Related papers (2023-11-17T18:01:58Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
We consider the problem of learning a latent $k$-vertex simplex $KsubsetmathbbRdtimes n$, given access to $AinmathbbRdtimes n$.
We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$.
arXiv Detail & Related papers (2021-05-17T16:40:48Z) - Deep Learning Meets Projective Clustering [66.726500395069]
A common approach for compressing NLP networks is to encode the embedding layer as a matrix $AinmathbbRntimes d$.
Inspired by emphprojective clustering from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces.
arXiv Detail & Related papers (2020-10-08T22:47:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.