Automated Clustering of High-dimensional Data with a Feature Weighted
Mean Shift Algorithm
- URL: http://arxiv.org/abs/2012.10929v1
- Date: Sun, 20 Dec 2020 14:00:40 GMT
- Title: Automated Clustering of High-dimensional Data with a Feature Weighted
Mean Shift Algorithm
- Authors: Saptarshi Chakraborty, Debolina Paul and Swagatam Das
- Abstract summary: Mean shift is a simple interactive procedure that shifts data points towards the mode which denotes the highest density of data points in the region.
We propose a simple yet elegant feature-weighted variant of mean shift to efficiently learn the feature importance.
The resulting algorithm not only outperforms the conventional mean shift clustering procedure but also preserves its computational simplicity.
- Score: 16.0817847880416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mean shift is a simple interactive procedure that gradually shifts data
points towards the mode which denotes the highest density of data points in the
region. Mean shift algorithms have been effectively used for data denoising,
mode seeking, and finding the number of clusters in a dataset in an automated
fashion. However, the merits of mean shift quickly fade away as the data
dimensions increase and only a handful of features contain useful information
about the cluster structure of the data. We propose a simple yet elegant
feature-weighted variant of mean shift to efficiently learn the feature
importance and thus, extending the merits of mean shift to high-dimensional
data. The resulting algorithm not only outperforms the conventional mean shift
clustering procedure but also preserves its computational simplicity. In
addition, the proposed method comes with rigorous theoretical convergence
guarantees and a convergence rate of at least a cubic order. The efficacy of
our proposal is thoroughly assessed through experimental comparison against
baseline and state-of-the-art clustering methods on synthetic as well as
real-world datasets.
Related papers
- Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Influence of Swarm Intelligence in Data Clustering Mechanisms [0.0]
Nature inspired Swarm based algorithms are used for data clustering to cope with larger datasets with lack and inconsistency of data.
This paper reviews the performances of these new approaches and compares which is best for certain problematic situation.
arXiv Detail & Related papers (2023-05-07T08:40:50Z) - ExClus: Explainable Clustering on Low-dimensional Data Representations [9.496898312608307]
Dimensionality reduction and clustering techniques are frequently used to analyze complex data sets, but their results are often not easy to interpret.
We consider how to support users in interpreting apparent cluster structure on scatter plots where the axes are not directly interpretable.
We propose a new method to compute an interpretable clustering automatically, where the explanation is in the original high-dimensional space and the clustering is coherent in the low-dimensional projection.
arXiv Detail & Related papers (2021-11-04T21:24:01Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Too Much Information Kills Information: A Clustering Perspective [6.375668163098171]
We propose a simple, but novel approach for variance-based k-clustering tasks, including in which is the widely known k-means clustering.
The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only.
With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability.
arXiv Detail & Related papers (2020-09-16T01:54:26Z) - SDCOR: Scalable Density-based Clustering for Local Outlier Detection in
Massive-Scale Datasets [0.0]
This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets.
Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity.
arXiv Detail & Related papers (2020-06-13T11:07:37Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z) - Autoencoder-based time series clustering with energy applications [0.0]
Time series clustering is a challenging task due to the specific nature of the data.
In this paper we investigate the combination of a convolutional autoencoder and a k-medoids algorithm to perfom time series clustering.
arXiv Detail & Related papers (2020-02-10T10:04:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.