Deep Learning Meets Projective Clustering
- URL: http://arxiv.org/abs/2010.04290v1
- Date: Thu, 8 Oct 2020 22:47:48 GMT
- Title: Deep Learning Meets Projective Clustering
- Authors: Alaa Maalouf and Harry Lang and Daniela Rus and Dan Feldman
- Abstract summary: A common approach for compressing NLP networks is to encode the embedding layer as a matrix $AinmathbbRntimes d$.
Inspired by emphprojective clustering from computational geometry, we suggest replacing this subspace by a set of $k$ subspaces.
- Score: 66.726500395069
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common approach for compressing NLP networks is to encode the embedding
layer as a matrix $A\in\mathbb{R}^{n\times d}$, compute its rank-$j$
approximation $A_j$ via SVD, and then factor $A_j$ into a pair of matrices that
correspond to smaller fully-connected layers to replace the original embedding
layer. Geometrically, the rows of $A$ represent points in $\mathbb{R}^d$, and
the rows of $A_j$ represent their projections onto the $j$-dimensional subspace
that minimizes the sum of squared distances ("errors") to the points. In
practice, these rows of $A$ may be spread around $k>1$ subspaces, so factoring
$A$ based on a single subspace may lead to large errors that turn into large
drops in accuracy.
Inspired by \emph{projective clustering} from computational geometry, we
suggest replacing this subspace by a set of $k$ subspaces, each of dimension
$j$, that minimizes the sum of squared distances over every point (row in $A$)
to its \emph{closest} subspace. Based on this approach, we provide a novel
architecture that replaces the original embedding layer by a set of $k$ small
layers that operate in parallel and are then recombined with a single
fully-connected layer.
Extensive experimental results on the GLUE benchmark yield networks that are
both more accurate and smaller compared to the standard matrix factorization
(SVD). For example, we further compress DistilBERT by reducing the size of the
embedding layer by $40\%$ while incurring only a $0.5\%$ average drop in
accuracy over all nine GLUE tasks, compared to a $2.8\%$ drop using the
existing SVD approach. On RoBERTa we achieve $43\%$ compression of the
embedding layer with less than a $0.8\%$ average drop in accuracy as compared
to a $3\%$ drop previously. Open code for reproducing and extending our results
is provided.
Related papers
- Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
We study the problem of residual error estimation for matrix and vector norms using a linear sketch.
We demonstrate that this gives a substantial advantage empirically, for roughly the same sketch size and accuracy as in previous work.
We also show an $Omega(k2/pn1-2/p)$ lower bound for the sparse recovery problem, which is tight up to a $mathrmpoly(log n)$ factor.
arXiv Detail & Related papers (2024-08-16T02:33:07Z) - Multilayer Correlation Clustering [12.492037397168579]
We establish Multilayer Correlation Clustering, a novel generalization of Correlation Clustering (Bansal et al., FOCS '02) to the multilayer setting.
In this paper, we are given a series of inputs of Correlation Clustering (called layers) over the common set $V$.
The goal is then to find a clustering of $V$ that minimizes the $ell_p$-norm ($pgeq 1$) of the disagreements vector.
arXiv Detail & Related papers (2024-04-25T15:25:30Z) - Robust Zero Level-Set Extraction from Unsigned Distance Fields Based on
Double Covering [28.268387694075415]
We propose a new method for extracting the zero level-set from unsigned distance fields (UDFs)
DoubleCoverUDF takes a learned UDF and a user-specified parameter $r$ as input.
We show that the computed iso-surface is the boundary of the $r$-offset volume of the target zero level-set $S$.
arXiv Detail & Related papers (2023-10-05T10:17:30Z) - Active Sampling for Linear Regression Beyond the $\ell_2$ Norm [70.49273459706546]
We study active sampling algorithms for linear regression, which aim to query only a small number of entries of a target vector.
We show that this dependence on $d$ is optimal, up to logarithmic factors.
We also provide the first total sensitivity upper bound $O(dmax1,p/2log2 n)$ for loss functions with at most degree $p$ growth.
arXiv Detail & Related papers (2021-11-09T00:20:01Z) - Coresets for Decision Trees of Signals [19.537354146654845]
We provide the first algorithm that outputs such a $(k,varepsilon)$-coreset for emphevery such matrix $D$.
This is by forging a link between decision trees from machine learning -- to partition trees in computational geometry.
arXiv Detail & Related papers (2021-10-07T05:49:55Z) - Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
We consider the problem of learning a latent $k$-vertex simplex $KsubsetmathbbRdtimes n$, given access to $AinmathbbRdtimes n$.
We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$.
arXiv Detail & Related papers (2021-05-17T16:40:48Z) - Small Covers for Near-Zero Sets of Polynomials and Learning Latent
Variable Models [56.98280399449707]
We show that there exists an $epsilon$-cover for $S$ of cardinality $M = (k/epsilon)O_d(k1/d)$.
Building on our structural result, we obtain significantly improved learning algorithms for several fundamental high-dimensional probabilistic models hidden variables.
arXiv Detail & Related papers (2020-12-14T18:14:08Z) - Farthest sampling segmentation of triangulated surfaces [0.0]
Farthest Sampling (FSS) is a new method for segmentation of triangulated surfaces.
The FSS method does not depend on parameters that must be tuned by hand and it is very flexible.
Numerical experiments with several metrics and a large variety of 3D triangular meshes show that the segmentations obtained computing less than the 10% of columns $W$ are as good as those obtained from clustering the rows of the full matrix $W$.
arXiv Detail & Related papers (2020-12-01T13:31:44Z) - Compressed Deep Networks: Goodbye SVD, Hello Robust Low-Rank
Approximation [23.06440095688755]
A common technique for compressing a neural network is to compute the $k$-rank $ell$ approximation $A_k,2$ of the matrix $AinmathbbRntimes d$ that corresponds to a fully connected layer (or embedding layer)
Here, $d$ is the number of the neurons in the layer, $n$ is the number in the next one, and $A_k,2$ can be stored in $O(n+d)k)$ memory instead of $O(nd)$.
This
arXiv Detail & Related papers (2020-09-11T20:21:42Z) - Maximizing Determinants under Matroid Constraints [69.25768526213689]
We study the problem of finding a basis $S$ of $M$ such that $det(sum_i in Sv_i v_i v_itop)$ is maximized.
This problem appears in a diverse set of areas such as experimental design, fair allocation of goods, network design, and machine learning.
arXiv Detail & Related papers (2020-04-16T19:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.