Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank
Approximation
- URL: http://arxiv.org/abs/2009.05647v2
- Date: Sat, 26 Sep 2020 12:24:06 GMT
- Title: Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank
Approximation
- Authors: Murad Tukan and Alaa Maalouf and Matan Weksler and Dan Feldman
- Abstract summary: A common technique for compressing a neural network is to compute the $k$-rank $ell$ approximation $A_k,2$ of the matrix $AinmathbbRntimes d$ that corresponds to a fully connected layer (or embedding layer)
Here, $d$ is the number of the neurons in the layer, $n$ is the number in the next one, and $A_k,2$ can be stored in $O(n+d)k)$ memory instead of $O(nd)$.
This
- Score: 23.06440095688755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common technique for compressing a neural network is to compute the
$k$-rank $\ell_2$ approximation $A_{k,2}$ of the matrix
$A\in\mathbb{R}^{n\times d}$ that corresponds to a fully connected layer (or
embedding layer). Here, $d$ is the number of the neurons in the layer, $n$ is
the number in the next one, and $A_{k,2}$ can be stored in $O((n+d)k)$ memory
instead of $O(nd)$.
This $\ell_2$-approximation minimizes the sum over every entry to the power
of $p=2$ in the matrix $A - A_{k,2}$, among every matrix
$A_{k,2}\in\mathbb{R}^{n\times d}$ whose rank is $k$. While it can be computed
efficiently via SVD, the $\ell_2$-approximation is known to be very sensitive
to outliers ("far-away" rows). Hence, machine learning uses e.g. Lasso
Regression, $\ell_1$-regularization, and $\ell_1$-SVM that use the
$\ell_1$-norm.
This paper suggests to replace the $k$-rank $\ell_2$ approximation by
$\ell_p$, for $p\in [1,2]$. We then provide practical and provable
approximation algorithms to compute it for any $p\geq1$, based on modern
techniques in computational geometry.
Extensive experimental results on the GLUE benchmark for compressing BERT,
DistilBERT, XLNet, and RoBERTa confirm this theoretical advantage. For example,
our approach achieves $28\%$ compression of RoBERTa's embedding layer with only
$0.63\%$ additive drop in the accuracy (without fine-tuning) in average over
all tasks in GLUE, compared to $11\%$ drop using the existing
$\ell_2$-approximation. Open code is provided for reproducing and extending our
results.
Related papers
- Coresets for Multiple $\ell_p$ Regression [47.790126028106734]
We construct coresets of size $tilde O(varepsilon-2d)$ for $p2$ and $tilde O(varepsilon-pdp/2)$ for $p>2$.
For $1p2$, every matrix has a subset of $tilde O(varepsilon-1k)$ rows which spans a $(varepsilon-1k)$-approximately optimal $k$-dimensional subspace for $ell_p$ subspace approximation
arXiv Detail & Related papers (2024-06-04T15:50:42Z) - Optimal Embedding Dimension for Sparse Subspace Embeddings [4.042707434058959]
A random $mtimes n$ matrix $S$ is an oblivious subspace embedding (OSE)
We show that an $mtimes n$ random matrix $S$ with $mgeq (1+theta)d$ is an oblivious subspace embedding with $epsilon = O_theta(1)$.
We use this to construct the first oblivious subspace embedding with $O(d)$ embedding dimension that can be applied faster than current matrix multiplication time.
arXiv Detail & Related papers (2023-11-17T18:01:58Z) - $\ell_p$-Regression in the Arbitrary Partition Model of Communication [59.89387020011663]
We consider the randomized communication complexity of the distributed $ell_p$-regression problem in the coordinator model.
For $p = 2$, i.e., least squares regression, we give the first optimal bound of $tildeTheta(sd2 + sd/epsilon)$ bits.
For $p in (1,2)$,we obtain an $tildeO(sd2/epsilon + sd/mathrmpoly(epsilon)$ upper bound.
arXiv Detail & Related papers (2023-07-11T08:51:53Z) - Low-Rank Approximation with $1/\epsilon^{1/3}$ Matrix-Vector Products [58.05771390012827]
We study iterative methods based on Krylov subspaces for low-rank approximation under any Schatten-$p$ norm.
Our main result is an algorithm that uses only $tildeO(k/sqrtepsilon)$ matrix-vector products.
arXiv Detail & Related papers (2022-02-10T16:10:41Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
We consider the problem of learning a latent $k$-vertex simplex $KsubsetmathbbRdtimes n$, given access to $AinmathbbRdtimes n$.
We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$.
arXiv Detail & Related papers (2021-05-17T16:40:48Z) - Deep Learning Meets Projective Clustering [66.726500395069]
A common approach for compressing NLP networks is to encode the embedding layer as a matrix $AinmathbbRntimes d$.
Inspired by emphprojective clustering from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces.
arXiv Detail & Related papers (2020-10-08T22:47:48Z) - On Efficient Low Distortion Ultrametric Embedding [18.227854382422112]
A widely-used method to preserve the underlying hierarchical structure of the data is to find an embedding of the data into a tree or an ultrametric.
In this paper, we provide a new algorithm which takes as input a set of points isometric in $mathbbRd2 (for universal constant $rho>1$) to output an ultrametric $Delta.
We show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.
arXiv Detail & Related papers (2020-08-15T11:06:45Z) - Average Case Column Subset Selection for Entrywise $\ell_1$-Norm Loss [76.02734481158458]
It is known that in the worst case, to obtain a good rank-$k$ approximation to a matrix, one needs an arbitrarily large $nOmega(1)$ number of columns.
We show that under certain minimal and realistic distributional settings, it is possible to obtain a $(k/epsilon)$-approximation with a nearly linear running time and poly$(k/epsilon)+O(klog n)$ columns.
This is the first algorithm of any kind for achieving a $(k/epsilon)$-approximation for entrywise
arXiv Detail & Related papers (2020-04-16T22:57:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.