SSDBCODI: Semi-Supervised Density-Based Clustering with Outliers
Detection Integrated
- URL: http://arxiv.org/abs/2208.05561v1
- Date: Wed, 10 Aug 2022 21:06:38 GMT
- Title: SSDBCODI: Semi-Supervised Density-Based Clustering with Outliers
Detection Integrated
- Authors: Jiahao Deng and Eli T. Brown
- Abstract summary: Clustering analysis is one of the critical tasks in machine learning.
Due to the fact that the performance of clustering clustering can be significantly eroded by outliers, algorithms try to incorporate the process of outlier detection.
We have proposed SSDBCODI, a semi-supervised detection element.
- Score: 1.8444322599555096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering analysis is one of the critical tasks in machine learning.
Traditionally, clustering has been an independent task, separate from outlier
detection. Due to the fact that the performance of clustering can be
significantly eroded by outliers, a small number of algorithms try to
incorporate outlier detection in the process of clustering. However, most of
those algorithms are based on unsupervised partition-based algorithms such as
k-means. Given the nature of those algorithms, they often fail to deal with
clusters of complex, non-convex shapes. To tackle this challenge, we have
proposed SSDBCODI, a semi-supervised density-based algorithm. SSDBCODI combines
the advantage of density-based algorithms, which are capable of dealing with
clusters of complex shapes, with the semi-supervised element, which offers
flexibility to adjust the clustering results based on a few user labels. We
also merge an outlier detection component with the clustering process.
Potential outliers are detected based on three scores generated during the
process: (1) reachability-score, which measures how density-reachable a point
is to a labeled normal object, (2) local-density-score, which measures the
neighboring density of data objects, and (3) similarity-score, which measures
the closeness of a point to its nearest labeled outliers. Then in the following
step, instance weights are generated for each data instance based on those
three scores before being used to train a classifier for further clustering and
outlier detection. To enhance the understanding of the proposed algorithm, for
our evaluation, we have run our proposed algorithm against some of the
state-of-art approaches on multiple datasets and separately listed the results
of outlier detection apart from clustering. Our results indicate that our
algorithm can achieve superior results with a small percentage of labels.
Related papers
- From Large to Small Datasets: Size Generalization for Clustering
Algorithm Selection [12.993073967843292]
We study a problem in a semi-supervised setting, with an unknown ground-truth clustering.
We introduce a notion of size generalization for clustering algorithm accuracy.
We use a subsample of as little as 5% of the data to identify which algorithm is best on the full dataset.
arXiv Detail & Related papers (2024-02-22T06:53:35Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - FLASC: A Flare-Sensitive Clustering Algorithm [0.0]
We present FLASC, an algorithm that detects branches within clusters to identify subpopulations.
Two variants of the algorithm are presented, which trade computational cost for noise robustness.
We show that both variants scale similarly to HDBSCAN* in terms of computational cost and provide stable outputs.
arXiv Detail & Related papers (2023-11-27T14:55:16Z) - Linear time Evidence Accumulation Clustering with KMeans [0.0]
This work describes a trick which mimic the behavior of average linkage clustering.
We found a way of computing efficiently the density of a partitioning, reducing the cost from a quadratic to linear complexity.
The k-means results are comparable to the best state of the art in terms of NMI while keeping the computational cost low.
arXiv Detail & Related papers (2023-11-15T14:12:59Z) - A Computational Theory and Semi-Supervised Algorithm for Clustering [0.0]
A semi-supervised clustering algorithm is presented.
The kernel of the clustering method is Mohammad's anomaly detection algorithm.
Results are presented on synthetic and realworld data sets.
arXiv Detail & Related papers (2023-06-12T09:15:58Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Rethinking Clustering-Based Pseudo-Labeling for Unsupervised
Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling.
This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data.
We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - Clustering of Big Data with Mixed Features [3.3504365823045044]
We develop a new clustering algorithm for large data of mixed type.
The algorithm is capable of detecting outliers and clusters of relatively lower density values.
We present experimental results to verify that our algorithm works well in practice.
arXiv Detail & Related papers (2020-11-11T19:54:38Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.