Efficient Sparse Spherical k-Means for Document Clustering
- URL: http://arxiv.org/abs/2108.00895v1
- Date: Fri, 30 Jul 2021 12:02:33 GMT
- Title: Efficient Sparse Spherical k-Means for Document Clustering
- Authors: Johannes Knittel, Steffen Koch, Thomas Ertl
- Abstract summary: We propose an efficient indexing structure to improve the scalability of Spherical k-Means with respect to k.
Our approach exploits the sparsity of the input vectors and the convergence behavior of k-Means to reduce the number of comparisons on each iteration significantly.
- Score: 13.217173710137363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spherical k-Means is frequently used to cluster document collections because
it performs reasonably well in many settings and is computationally efficient.
However, the time complexity increases linearly with the number of clusters k,
which limits the suitability of the algorithm for larger values of k depending
on the size of the collection. Optimizations targeted at the Euclidean k-Means
algorithm largely do not apply because the cosine distance is not a metric. We
therefore propose an efficient indexing structure to improve the scalability of
Spherical k-Means with respect to k. Our approach exploits the sparsity of the
input vectors and the convergence behavior of k-Means to reduce the number of
comparisons on each iteration significantly.
Related papers
- Accelerating k-Means Clustering with Cover Trees [0.30693357740321775]
We propose a new k-means algorithm based on the cover tree index, that has relatively low overhead and performs well.
We obtain a hybrid algorithm that combines the benefits of tree aggregation and bounds-based filtering.
arXiv Detail & Related papers (2024-10-19T14:02:42Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - A density peaks clustering algorithm with sparse search and K-d tree [16.141611031128427]
Density peaks clustering algorithm with sparse search and K-d tree is developed to solve this problem.
Experiments are carried out on datasets with different distribution characteristics, by comparing with other five typical clustering algorithms.
arXiv Detail & Related papers (2022-03-02T09:29:40Z) - A sampling-based approach for efficient clustering in large datasets [0.8952229340927184]
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters.
Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters.
arXiv Detail & Related papers (2021-12-29T19:15:20Z) - K-Splits: Improved K-Means Clustering Algorithm to Automatically Detect
the Number of Clusters [0.12313056815753944]
This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters.
Accuracy and speed are two main advantages of the proposed method.
arXiv Detail & Related papers (2021-10-09T23:02:57Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - An Efficient Smoothing Proximal Gradient Algorithm for Convex Clustering [2.5182813818441945]
Recently introduced convex clustering approach formulates clustering as a convex optimization problem.
State-of-the-art convex clustering algorithms require large computation and memory space.
In this paper, we develop a very efficient smoothing gradient algorithm (Sproga) for convex clustering.
arXiv Detail & Related papers (2020-06-22T20:02:59Z) - SimpleMKKM: Simple Multiple Kernel K-means [49.500663154085586]
We propose a simple yet effective multiple kernel clustering algorithm, termed simple multiple kernel k-means (SimpleMKKM)
Our criterion is given by an intractable minimization-maximization problem in the kernel coefficient and clustering partition matrix.
We theoretically analyze the performance of SimpleMKKM in terms of its clustering generalization error.
arXiv Detail & Related papers (2020-05-11T10:06:40Z) - Ball k-means [53.89505717006118]
The Ball k-means algorithm uses a ball to describe a cluster, focusing on reducing the point-centroid distance computation.
The fast speed, no extra parameters and simple design of the Ball k-means make it an all-around replacement of the naive k-means algorithm.
arXiv Detail & Related papers (2020-05-02T10:39:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.