Faster DBSCAN via subsampled similarity queries
- URL: http://arxiv.org/abs/2006.06743v2
- Date: Thu, 22 Oct 2020 01:19:35 GMT
- Title: Faster DBSCAN via subsampled similarity queries
- Authors: Heinrich Jiang, Jennifer Jang, Jakub {\L}\k{a}cki
- Abstract summary: DBSCAN is a popular density-based clustering algorithm.
We propose SNG-DBSCAN, which clusters based on a subsampled $epsilon$-neighborhood graph.
- Score: 42.93847281082316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DBSCAN is a popular density-based clustering algorithm. It computes the
$\epsilon$-neighborhood graph of a dataset and uses the connected components of
the high-degree nodes to decide the clusters. However, the full neighborhood
graph may be too costly to compute with a worst-case complexity of $O(n^2)$. In
this paper, we propose a simple variant called SNG-DBSCAN, which clusters based
on a subsampled $\epsilon$-neighborhood graph, only requires access to
similarity queries for pairs of points and in particular avoids any complex
data structures which need the embeddings of the data points themselves. The
runtime of the procedure is $O(sn^2)$, where $s$ is the sampling rate. We show
under some natural theoretical assumptions that $s \approx \log n/n$ is
sufficient for statistical cluster recovery guarantees leading to an $O(n\log
n)$ complexity. We provide an extensive experimental analysis showing that on
large datasets, one can subsample as little as $0.1\%$ of the neighborhood
graph, leading to as much as over 200x speedup and 250x reduction in RAM
consumption compared to scikit-learn's implementation of DBSCAN, while still
maintaining competitive clustering performance.
Related papers
- Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
We show that for a broad class of data structures their bounds cannot be significantly improved.
This is a novel emphstatistical-computational trade-off for density estimation.
arXiv Detail & Related papers (2024-10-30T15:03:33Z) - A Unified Framework for Gradient-based Clustering of Distributed Data [51.904327888475606]
We develop a family of distributed clustering algorithms that work over networks of users.
DGC-$mathcalF_rho$ is specialized to popular clustering losses like $K$-means and Huber loss.
We show that consensus fixed points of DGC-$mathcalF_rho$ are equivalent to fixed points of gradient clustering over the full data.
arXiv Detail & Related papers (2024-02-02T10:44:42Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - On the Unlikelihood of D-Separation [69.62839677485087]
We provide analytic evidence that on large graphs, d-separation is a rare phenomenon, even when guaranteed to exist.
For the PC Algorithm, while it is known that its worst-case guarantees fail on non-sparse graphs, we show that the same is true for the average case.
For UniformSGS, while it is known that the running time is exponential for existing edges, we show that in the average case, that is the expected running time for most non-existing edges as well.
arXiv Detail & Related papers (2023-03-10T00:11:18Z) - Systematically improving existing k-means initialization algorithms at
nearly no cost, by pairwise-nearest-neighbor smoothing [1.2570180539670577]
We present a meta-method for initializing the $k$-means clustering algorithm called PNN-smoothing.
It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor method.
arXiv Detail & Related papers (2022-02-08T15:56:30Z) - Near-Optimal Explainable $k$-Means for All Dimensions [13.673697350508965]
We show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k1 - 2/dmathrmpoly(dlog k)$.
We get an improved bound of $k1 - 2/dmathrmpolylog(k)$, which we show is optimal for every choice of $k,dge 2$ up to a poly-logarithmic factor in $k$.
arXiv Detail & Related papers (2021-06-29T16:59:03Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner.
We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost.
We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z) - Query-Efficient Correlation Clustering [13.085439249887713]
Correlation clustering is arguably the most natural formulation of clustering.
A main drawback of correlation clustering is that it requires as input the $Theta(n2)$ pairwise similarities.
We devise a correlation clustering algorithm that attains a solution whose expected number of disagreements is at most $3cdot OPT + O(fracn3Q)$.
arXiv Detail & Related papers (2020-02-26T15:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.