Related papers: Faster DBSCAN via subsampled similarity queries

Faster DBSCAN via subsampled similarity queries

URL: http://arxiv.org/abs/2006.06743v2
Date: Thu, 22 Oct 2020 01:19:35 GMT
Title: Faster DBSCAN via subsampled similarity queries
Authors: Heinrich Jiang, Jennifer Jang, Jakub {\L}\k{a}cki
Abstract summary: DBSCAN is a popular density-based clustering algorithm. We propose SNG-DBSCAN, which clusters based on a subsampled $epsilon$-neighborhood graph.
Score: 42.93847281082316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: DBSCAN is a popular density-based clustering algorithm. It computes the $\epsilon$-neighborhood graph of a dataset and uses the connected components of the high-degree nodes to decide the clusters. However, the full neighborhood graph may be too costly to compute with a worst-case complexity of $O(n^2)$. In this paper, we propose a simple variant called SNG-DBSCAN, which clusters based on a subsampled $\epsilon$-neighborhood graph, only requires access to similarity queries for pairs of points and in particular avoids any complex data structures which need the embeddings of the data points themselves. The runtime of the procedure is $O(sn^2)$, where $s$ is the sampling rate. We show under some natural theoretical assumptions that $s \approx \log n/n$ is sufficient for statistical cluster recovery guarantees leading to an $O(n\log n)$ complexity. We provide an extensive experimental analysis showing that on large datasets, one can subsample as little as $0.1\%$ of the neighborhood graph, leading to as much as over 200x speedup and 250x reduction in RAM consumption compared to scikit-learn's implementation of DBSCAN, while still maintaining competitive clustering performance.

Related papers

Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
We show that for a broad class of data structures their bounds cannot be significantly improved. This is a novel emphstatistical-computational trade-off for density estimation.
arXiv Detail & Related papers (2024-10-30T15:03:33Z)
Simple, Scalable and Effective Clustering via One-Dimensional Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis. We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$. We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z)
Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets. We present an improved version of the "$q$-means" algorithm for clustering. We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z)
On the Unlikelihood of D-Separation [69.62839677485087]
We provide analytic evidence that on large graphs, d-separation is a rare phenomenon, even when guaranteed to exist. For the PC Algorithm, while it is known that its worst-case guarantees fail on non-sparse graphs, we show that the same is true for the average case. For UniformSGS, while it is known that the running time is exponential for existing edges, we show that in the average case, that is the expected running time for most non-existing edges as well.
arXiv Detail & Related papers (2023-03-10T00:11:18Z)
Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner. We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z)
Systematically improving existing k-means initialization algorithms at nearly no cost, by pairwise-nearest-neighbor smoothing [1.2570180539670577]
We present a meta-method for initializing the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor method.
arXiv Detail & Related papers (2022-02-08T15:56:30Z)
Near-Optimal Explainable $k$-Means for All Dimensions [13.673697350508965]
We show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k1 - 2/dmathrmpoly(dlog k)$. We get an improved bound of $k1 - 2/dmathrmpolylog(k)$, which we show is optimal for every choice of $k,dge 2$ up to a poly-logarithmic factor in $k$.
arXiv Detail & Related papers (2021-06-29T16:59:03Z)
Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem. We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z)
Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA. We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
Explainable $k$-Means and $k$-Medians Clustering [25.513261099927163]
We consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We show that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost. We design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves.
arXiv Detail & Related papers (2020-02-28T04:21:53Z)
Query-Efficient Correlation Clustering [13.085439249887713]
Correlation clustering is arguably the most natural formulation of clustering. A main drawback of correlation clustering is that it requires as input the $Theta(n2)$ pairwise similarities. We devise a correlation clustering algorithm that attains a solution whose expected number of disagreements is at most $3cdot OPT + O(fracn3Q)$.
arXiv Detail & Related papers (2020-02-26T15:18:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.