Scalable Initialization Methods for Large-Scale Clustering
- URL: http://arxiv.org/abs/2007.11937v1
- Date: Thu, 23 Jul 2020 11:29:53 GMT
- Title: Scalable Initialization Methods for Large-Scale Clustering
- Authors: Joonas H\"am\"al\"ainen, Tommi K\"arkk\"ainen, Tuomo Rossi
- Abstract summary: Two new methods for K-means clustering are proposed.
The proposed methods are scalable and can be run in parallel.
Experiments show that the proposed methods compare favorably to the state-of-the-art.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, two new initialization methods for K-means clustering are
proposed. Both proposals are based on applying a divide-and-conquer approach
for the K-means|| type of an initialization strategy. The second proposal also
utilizes multiple lower-dimensional subspaces produced by the random projection
method for the initialization. The proposed methods are scalable and can be run
in parallel, which make them suitable for initializing large-scale problems. In
the experiments, comparison of the proposed methods to the K-means++ and
K-means|| methods is conducted using an extensive set of reference and
synthetic large-scale datasets. Concerning the latter, a novel high-dimensional
clustering data generation algorithm is given. The experiments show that the
proposed methods compare favorably to the state-of-the-art. We also observe
that the currently most popular K-means++ initialization behaves like the
random one in the very high-dimensional cases.
Related papers
- Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - CKmeans and FCKmeans : Two deterministic initialization procedures for
Kmeans algorithm using a modified crowding distance [0.0]
Two novel deterministic procedures for K-means clustering are presented.
The procedures, named CKmeans and FCKmeans, use more crowded points as initial centroids.
Experimental studies on multiple datasets demonstrate that the proposed approach outperforms Kmeans and Kmeans++ in terms of clustering accuracy.
arXiv Detail & Related papers (2023-04-19T21:46:02Z) - An enhanced method of initial cluster center selection for K-means
algorithm [0.0]
We propose a novel approach to improve initial cluster selection for K-means algorithm.
The Convex Hull algorithm facilitates the computing of the first two centroids and the remaining ones are selected according to the distance from previously selected centers.
We obtained only 7.33%, 7.90%, and 0% clustering error in Iris, Letter, and Ruspini data respectively.
arXiv Detail & Related papers (2022-10-18T00:58:50Z) - $k$-Median Clustering via Metric Embedding: Towards Better
Initialization with Differential Privacy [31.963659189956367]
We develop a new HST scheme, for the $k$-median problem in the general metric space.
From the tree, we propose a novel and efficient search algorithm, for good initial centers.
Our approach can also be extended to the $k$-means problem.
arXiv Detail & Related papers (2022-06-26T14:58:36Z) - Gradient Based Clustering [72.15857783681658]
We propose a general approach for distance based clustering, using the gradient of the cost function that measures clustering quality.
The approach is an iterative two step procedure (alternating between cluster assignment and cluster center updates) and is applicable to a wide range of functions.
arXiv Detail & Related papers (2022-02-01T19:31:15Z) - K-Splits: Improved K-Means Clustering Algorithm to Automatically Detect
the Number of Clusters [0.12313056815753944]
This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters.
Accuracy and speed are two main advantages of the proposed method.
arXiv Detail & Related papers (2021-10-09T23:02:57Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - Learnable Subspace Clustering [76.2352740039615]
We develop a learnable subspace clustering paradigm to efficiently solve the large-scale subspace clustering problem.
The key idea is to learn a parametric function to partition the high-dimensional subspaces into their underlying low-dimensional subspaces.
To the best of our knowledge, this paper is the first work to efficiently cluster millions of data points among the subspace clustering methods.
arXiv Detail & Related papers (2020-04-09T12:53:28Z) - Optimal Randomized First-Order Methods for Least-Squares Problems [56.05635751529922]
This class of algorithms encompasses several randomized methods among the fastest solvers for least-squares problems.
We focus on two classical embeddings, namely, Gaussian projections and subsampled Hadamard transforms.
Our resulting algorithm yields the best complexity known for solving least-squares problems with no condition number dependence.
arXiv Detail & Related papers (2020-02-21T17:45:32Z) - A novel initialisation based on hospital-resident assignment for the
k-modes algorithm [0.0]
This paper presents a new way of selecting an initial solution for the k-modes algorithm.
It allows for a notion of mathematical fairness and a leverage of the data that the common initialisations from literature do not.
arXiv Detail & Related papers (2020-02-07T10:20:49Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.