Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile
Streaming
- URL: http://arxiv.org/abs/2310.19068v1
- Date: Sun, 29 Oct 2023 16:46:26 GMT
- Title: Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile
Streaming
- Authors: Gregory Dexter, Petros Drineas, David P. Woodruff, Taisuke Yasuda
- Abstract summary: We develop new techniques to extend the applicability of sketching-based approaches to sparse dictionary learning and the Euclidean $k$-means clustering problems.
On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem.
On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering.
- Score: 48.18845814885398
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sketching algorithms have recently proven to be a powerful approach both for
designing low-space streaming algorithms as well as fast polynomial time
approximation schemes (PTAS). In this work, we develop new techniques to extend
the applicability of sketching-based approaches to the sparse dictionary
learning and the Euclidean $k$-means clustering problems. In particular, we
initiate the study of the challenging setting where the dictionary/clustering
assignment for each of the $n$ input points must be output, which has
surprisingly received little attention in prior work. On the fast algorithms
front, we obtain a new approach for designing PTAS's for the $k$-means
clustering problem, which generalizes to the first PTAS for the sparse
dictionary learning problem. On the streaming algorithms front, we obtain new
upper bounds and lower bounds for dictionary learning and $k$-means clustering.
In particular, given a design matrix $\mathbf A\in\mathbb R^{n\times d}$ in a
turnstile stream, we show an $\tilde O(nr/\epsilon^2 + dk/\epsilon)$ space
upper bound for $r$-sparse dictionary learning of size $k$, an $\tilde
O(n/\epsilon^2 + dk/\epsilon)$ space upper bound for $k$-means clustering, as
well as an $\tilde O(n)$ space upper bound for $k$-means clustering on random
order row insertion streams with a natural "bounded sensitivity" assumption. On
the lower bounds side, we obtain a general $\tilde\Omega(n/\epsilon +
dk/\epsilon)$ lower bound for $k$-means clustering, as well as an
$\tilde\Omega(n/\epsilon^2)$ lower bound for algorithms which can estimate the
cost of a single fixed set of candidate centers.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Near-Optimal Bounds for Learning Gaussian Halfspaces with Random
Classification Noise [50.64137465792738]
We show that any efficient SQ algorithm for the problem requires sample complexity at least $Omega(d1/2/(maxp, epsilon)2)$.
Our lower bound suggests that this quadratic dependence on $1/epsilon$ is inherent for efficient algorithms.
arXiv Detail & Related papers (2023-07-13T18:59:28Z) - Near-Optimal Quantum Coreset Construction Algorithms for Clustering [15.513270929560088]
We give quantum algorithms that find coresets for $k$-clustering in $mathbbRd$ with $tildeO(sqrtnkd3/2)$ query complexity.
Our coreset reduces the input size from $n$ to $mathrmpoly(kepsilon-1d)$, so that existing $alpha$-approximation algorithms for clustering can run on top of it.
arXiv Detail & Related papers (2023-06-05T12:22:46Z) - Improved Learning-augmented Algorithms for k-means and k-medians
Clustering [8.04779839951237]
We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space.
We propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost.
Our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $alpha$ up to $1/2$, an improvement over $alpha$ being at most $1/7$ in the previous work.
arXiv Detail & Related papers (2022-10-31T03:00:11Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.