Scalable Density-based Clustering with Random Projections
- URL: http://arxiv.org/abs/2402.15679v1
- Date: Sat, 24 Feb 2024 01:45:51 GMT
- Title: Scalable Density-based Clustering with Random Projections
- Authors: Haochuan Xu, Ninh Pham
- Abstract summary: We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance.
Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets.
- Score: 9.028773906859541
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present sDBSCAN, a scalable density-based clustering algorithm in high
dimensions with cosine distance. Utilizing the neighborhood-preserving property
of random projections, sDBSCAN can quickly identify core points and their
neighborhoods, the primary hurdle of density-based clustering. Theoretically,
sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions
with high probability. To further facilitate sDBSCAN, we present sOPTICS, a
scalable OPTICS for interactive exploration of the intrinsic clustering
structure. We also extend sDBSCAN and sOPTICS to L2, L1, $\chi^2$, and
Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is
significantly faster and provides higher accuracy than many other clustering
algorithms on real-world million-point data sets. On these data sets, sDBSCAN
and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand
several hours or cannot run due to memory constraints.
Related papers
- LINSCAN -- A Linearity Based Clustering Algorithm [41.87020317965649]
DBSCAN and OPTICS are powerful algorithms for identifying clusters of points in domains where few assumptions can be made about the structure of the data.
We introduce a new algorithm, LINSCAN, designed to seek lineated clusters that are difficult to find and isolate with existing methods.
We demonstrate how LINSCAN can be applied to seismic data to identify active faults, including intersecting faults, and determine their orientation.
arXiv Detail & Related papers (2024-06-25T21:58:37Z) - FLASC: A Flare-Sensitive Clustering Algorithm [0.0]
We present FLASC, an algorithm that detects branches within clusters to identify subpopulations.
Two variants of the algorithm are presented, which trade computational cost for noise robustness.
We show that both variants scale similarly to HDBSCAN* in terms of computational cost and provide stable outputs.
arXiv Detail & Related papers (2023-11-27T14:55:16Z) - GBMST: An Efficient Minimum Spanning Tree Clustering Based on
Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST)
We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority"
Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z) - Local Sample-weighted Multiple Kernel Clustering with Consensus
Discriminative Graph [73.68184322526338]
Multiple kernel clustering (MKC) is committed to achieving optimal information fusion from a set of base kernels.
This paper proposes a novel local sample-weighted multiple kernel clustering model.
Experimental results demonstrate that our LSWMKC possesses better local manifold representation and outperforms existing kernel or graph-based clustering algo-rithms.
arXiv Detail & Related papers (2022-07-05T05:00:38Z) - An Improved Probability Propagation Algorithm for Density Peak
Clustering Based on Natural Nearest Neighborhood [0.0]
Clustering by fast search and find of density peaks (DPC) has been proven to be a promising clustering approach.
This paper presents an improved probability propagation algorithm for density peak clustering based on the natural nearest neighborhood (DPC-PPNNN)
In experiments on several datasets, DPC-PPNNN is shown to outperform DPC, K-means and DBSCAN.
arXiv Detail & Related papers (2022-07-04T03:36:57Z) - A density peaks clustering algorithm with sparse search and K-d tree [16.141611031128427]
Density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem.
Experiments are carried out on datasets with different distribution characteristics, by comparing with other five typical clustering algorithms.
arXiv Detail & Related papers (2022-03-02T09:29:40Z) - Density-Based Clustering with Kernel Diffusion [59.4179549482505]
A naive density corresponding to the indicator function of a unit $d$-dimensional Euclidean ball is commonly used in density-based clustering algorithms.
We propose a new kernel diffusion density function, which is adaptive to data of varying local distributional characteristics and smoothness.
arXiv Detail & Related papers (2021-10-11T09:00:33Z) - Fast Density Estimation for Density-based Clustering Methods [3.8972699157287702]
Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning.
The robustness of density-based algorithms is heavily dominated by finding neighbors and calculating the density of each point which is time-consuming.
This paper proposes a density-based clustering framework by using the fast principal component analysis, which can be applied to density based methods to prune unnecessary distance calculations.
arXiv Detail & Related papers (2021-09-23T13:59:42Z) - Random Features for the Neural Tangent Kernel [57.132634274795066]
We propose an efficient feature map construction of the Neural Tangent Kernel (NTK) of fully-connected ReLU network.
We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
arXiv Detail & Related papers (2021-04-03T09:08:12Z) - (k, l)-Medians Clustering of Trajectories Using Continuous Dynamic Time
Warping [57.316437798033974]
In this work we consider the problem of center-based clustering of trajectories.
We propose the usage of a continuous version of DTW as distance measure, which we call continuous dynamic time warping (CDTW)
We show a practical way to compute a center from a set of trajectories and subsequently iteratively improve it.
arXiv Detail & Related papers (2020-12-01T13:17:27Z) - Learnable Subspace Clustering [76.2352740039615]
We develop a learnable subspace clustering paradigm to efficiently solve the large-scale subspace clustering problem.
The key idea is to learn a parametric function to partition the high-dimensional subspaces into their underlying low-dimensional subspaces.
To the best of our knowledge, this paper is the first work to efficiently cluster millions of data points among the subspace clustering methods.
arXiv Detail & Related papers (2020-04-09T12:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.