SDCOR: Scalable Density-based Clustering for Local Outlier Detection in
Massive-Scale Datasets
- URL: http://arxiv.org/abs/2006.07616v11
- Date: Mon, 26 Apr 2021 11:50:30 GMT
- Title: SDCOR: Scalable Density-based Clustering for Local Outlier Detection in
Massive-Scale Datasets
- Authors: Sayyed Ahmad Naghavi Nozad and Maryam Amir Haeri and Gianluigi Folino
- Abstract summary: This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets.
Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a batch-wise density-based clustering approach for local
outlier detection in massive-scale datasets. Unlike the well-known traditional
algorithms, which assume that all the data is memory-resident, our proposed
method is scalable and processes the input data chunk-by-chunk within the
confines of a limited memory buffer. A temporary clustering model is built at
the first phase; then, it is gradually updated by analyzing consecutive memory
loads of points. Subsequently, at the end of scalable clustering, the
approximate structure of the original clusters is obtained. Finally, by another
scan of the entire dataset and using a suitable criterion, an outlying score is
assigned to each object called SDCOR (Scalable Density-based Clustering
Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate
that the proposed method has a low linear time complexity and is more effective
and efficient compared to best-known conventional density-based methods, which
need to load all data into the memory; and also, to some fast distance-based
methods, which can perform on data resident in the disk.
Related papers
- Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Research on Efficient Fuzzy Clustering Method Based on Local Fuzzy
Granular balls [67.33923111887933]
In this paper, the data is fuzzy iterated using granular-balls, and the membership degree of data only considers the two granular-balls where it is located.
The formed fuzzy granular-balls set can use more processing methods in the face of different data scenarios.
arXiv Detail & Related papers (2023-03-07T01:52:55Z) - SSDBCODI: Semi-Supervised Density-Based Clustering with Outliers
Detection Integrated [1.8444322599555096]
Clustering analysis is one of the critical tasks in machine learning.
Due to the fact that the performance of clustering clustering can be significantly eroded by outliers, algorithms try to incorporate the process of outlier detection.
We have proposed SSDBCODI, a semi-supervised detection element.
arXiv Detail & Related papers (2022-08-10T21:06:38Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - Density-Based Clustering with Kernel Diffusion [59.4179549482505]
A naive density corresponding to the indicator function of a unit $d$-dimensional Euclidean ball is commonly used in density-based clustering algorithms.
We propose a new kernel diffusion density function, which is adaptive to data of varying local distributional characteristics and smoothness.
arXiv Detail & Related papers (2021-10-11T09:00:33Z) - Clustering Plotted Data by Image Segmentation [12.443102864446223]
Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data.
In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data.
Our approach, Visual Clustering, has several advantages over traditional clustering algorithms.
arXiv Detail & Related papers (2021-10-06T06:19:30Z) - Fast Density Estimation for Density-based Clustering Methods [3.8972699157287702]
Density-based clustering algorithms are widely used for discovering clusters in pattern recognition and machine learning.
The robustness of density-based algorithms is heavily dominated by finding neighbors and calculating the density of each point which is time-consuming.
This paper proposes a density-based clustering framework by using the fast principal component analysis, which can be applied to density based methods to prune unnecessary distance calculations.
arXiv Detail & Related papers (2021-09-23T13:59:42Z) - Very Compact Clusters with Structural Regularization via Similarity and
Connectivity [3.779514860341336]
We propose an end-to-end deep clustering algorithm, i.e., Very Compact Clusters (VCC) for the general datasets.
Our proposed approach achieves better clustering performance over most of the state-of-the-art clustering methods.
arXiv Detail & Related papers (2021-06-09T23:22:03Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.