Relational Algorithms for k-means Clustering
- URL: http://arxiv.org/abs/2008.00358v2
- Date: Thu, 20 May 2021 22:18:08 GMT
- Title: Relational Algorithms for k-means Clustering
- Authors: Benjamin Moseley, Kirk Pruhs, Alireza Samadian, Yuyan Wang
- Abstract summary: This paper gives a k-means approximation algorithm that is efficient in the relational algorithms model.
The running time is potentially exponentially smaller than $N$, the number of data points to be clustered that the relational database represents.
- Score: 17.552485682328772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper gives a k-means approximation algorithm that is efficient in the
relational algorithms model. This is an algorithm that operates directly on a
relational database without performing a join to convert it to a matrix whose
rows represent the data points. The running time is potentially exponentially
smaller than $N$, the number of data points to be clustered that the relational
database represents.
Few relational algorithms are known and this paper offers techniques for
designing relational algorithms as well as characterizing their limitations. We
show that given two data points as cluster centers, if we cluster points
according to their closest centers, it is NP-Hard to approximate the number of
points in the clusters on a general relational input. This is trivial for
conventional data inputs and this result exemplifies that standard algorithmic
techniques may not be directly applied when designing an efficient relational
algorithm. This paper then introduces a new method that leverages rejection
sampling and the $k$-means++ algorithm to construct an O(1)-approximate k-means
solution.
Related papers
- A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Linear time Evidence Accumulation Clustering with KMeans [0.0]
This work describes a trick which mimic the behavior of average linkage clustering.
We found a way of computing efficiently the density of a partitioning, reducing the cost from a quadratic to linear complexity.
The k-means results are comparable to the best state of the art in terms of NMI while keeping the computational cost low.
arXiv Detail & Related papers (2023-11-15T14:12:59Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - An enhanced method of initial cluster center selection for K-means
algorithm [0.0]
We propose a novel approach to improve initial cluster selection for K-means algorithm.
The Convex Hull algorithm facilitates the computing of the first two centroids and the remaining ones are selected according to the distance from previously selected centers.
We obtained only 7.33%, 7.90%, and 0% clustering error in Iris, Letter, and Ruspini data respectively.
arXiv Detail & Related papers (2022-10-18T00:58:50Z) - Careful seeding for the k-medoids algorithm with incremental k++ cluster
construction [4.981260380070016]
An improved k-medoids algorithm (INCKM) was recently proposed to overcome this drawback.
We propose a novel incremental k-medoids algorithm (INCKPP) which dynamically increases the number of clusters.
Our algorithm can overcome the parameter selection problem in the improved k-medoids algorithm, improve clustering performance, and deal with imbalanced datasets very well.
arXiv Detail & Related papers (2022-07-06T02:25:35Z) - ck-means, a novel unsupervised learning method that combines fuzzy and
crispy clustering methods to extract intersecting data [1.827510863075184]
This paper proposes a method to cluster data that share the same intersections between two features or more.
The main idea of this novel method is to generate fuzzy clusters of data using a Fuzzy C-Means (FCM) algorithm.
The algorithm is also able to find the optimal number of clusters for the FCM and the k-means algorithm, according to the consistency of the clusters given by the Silhouette Index (SI)
arXiv Detail & Related papers (2022-06-17T19:29:50Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - Clustering of Big Data with Mixed Features [3.3504365823045044]
We develop a new clustering algorithm for large data of mixed type.
The algorithm is capable of detecting outliers and clusters of relatively lower density values.
We present experimental results to verify that our algorithm works well in practice.
arXiv Detail & Related papers (2020-11-11T19:54:38Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - FedPD: A Federated Learning Framework with Optimal Rates and Adaptivity
to Non-IID Data [59.50904660420082]
Federated Learning (FL) has become a popular paradigm for learning from distributed data.
To effectively utilize data at different devices without moving them to the cloud, algorithms such as the Federated Averaging (FedAvg) have adopted a "computation then aggregation" (CTA) model.
arXiv Detail & Related papers (2020-05-22T23:07:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.