Randomized Dimensionality Reduction for Facility Location and
Single-Linkage Clustering
- URL: http://arxiv.org/abs/2107.01804v1
- Date: Mon, 5 Jul 2021 05:55:26 GMT
- Title: Randomized Dimensionality Reduction for Facility Location and
Single-Linkage Clustering
- Authors: Shyam Narayanan, Sandeep Silwal, Piotr Indyk, Or Zamir
- Abstract summary: Random dimensionality reduction is a versatile tool for speeding up algorithms for high-dimensional problems.
We study its application to two clustering problems: the facility location problem, and the single-linkage hierarchical clustering problem.
- Score: 13.208510864854894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Random dimensionality reduction is a versatile tool for speeding up
algorithms for high-dimensional problems. We study its application to two
clustering problems: the facility location problem, and the single-linkage
hierarchical clustering problem, which is equivalent to computing the minimum
spanning tree. We show that if we project the input pointset $X$ onto a random
$d = O(d_X)$-dimensional subspace (where $d_X$ is the doubling dimension of
$X$), then the optimum facility location cost in the projected space
approximates the original cost up to a constant factor. We show an analogous
statement for minimum spanning tree, but with the dimension $d$ having an extra
$\log \log n$ term and the approximation factor being arbitrarily close to $1$.
Furthermore, we extend these results to approximating solutions instead of just
their costs. Lastly, we provide experimental results to validate the quality of
solutions and the speedup due to the dimensionality reduction. Unlike several
previous papers studying this approach in the context of $k$-means and
$k$-medians, our dimension bound does not depend on the number of clusters but
only on the intrinsic dimensionality of $X$.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Debiasing and a local analysis for population clustering using
semidefinite programming [1.9761774213809036]
We consider the problem of partitioning a small data sample of size $n$ drawn from a mixture of $2$ sub-gaussian distributions.
This work is motivated by the application of clustering individuals according to their population of origin.
arXiv Detail & Related papers (2024-01-16T03:14:24Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Pseudonorm Approachability and Applications to Regret Minimization [73.54127663296906]
We convert high-dimensional $ell_infty$-approachability problems to low-dimensional pseudonorm approachability problems.
We develop an algorithmic theory of pseudonorm approachability, analogous to previous work on approachability for $ell$ and other norms.
arXiv Detail & Related papers (2023-02-03T03:19:14Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - New Coresets for Projective Clustering and Applications [34.82221047030618]
Given a set of points $P$ in $mathbbRd$, the goal is to find $k$ flats of dimension $j$, i.e., affine subspaces, that best fit $P$ under a given distance measure.
We show that our construction provides efficient coreset constructions for Cauchy, Welsch, Huber, Geman-McClure, Tukey, $L_infty$, and Fair regression.
arXiv Detail & Related papers (2022-03-08T19:50:27Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Maximizing Determinants under Matroid Constraints [69.25768526213689]
We study the problem of finding a basis $S$ of $M$ such that $det(sum_i in Sv_i v_i v_itop)$ is maximized.
This problem appears in a diverse set of areas such as experimental design, fair allocation of goods, network design, and machine learning.
arXiv Detail & Related papers (2020-04-16T19:16:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.