Evolution of $K$-means solution landscapes with the addition of dataset
outliers and a robust clustering comparison measure for their analysis
- URL: http://arxiv.org/abs/2306.14346v1
- Date: Sun, 25 Jun 2023 21:22:21 GMT
- Title: Evolution of $K$-means solution landscapes with the addition of dataset
outliers and a robust clustering comparison measure for their analysis
- Authors: Luke Dicks and David J. Wales
- Abstract summary: We use the energy landscape approach to map the change in $K$-means solution space as a result of increasing dataset outliers.
Kinetic analysis reveals that in all cases the overall funnel is composed of shallow locally-funnelled regions.
We propose that the rates obtained from kinetic analysis provide a novel measure of clustering similarity.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The $K$-means algorithm remains one of the most widely-used clustering
methods due to its simplicity and general utility. The performance of $K$-means
depends upon location of minima low in cost function, amongst a potentially
vast number of solutions. Here, we use the energy landscape approach to map the
change in $K$-means solution space as a result of increasing dataset outliers
and show that the cost function surface becomes more funnelled. Kinetic
analysis reveals that in all cases the overall funnel is composed of shallow
locally-funnelled regions, each of which are separated by areas that do not
support any clustering solutions. These shallow regions correspond to different
types of clustering solution and their increasing number with outliers leads to
longer pathways within the funnel and a reduced correlation between accuracy
and cost function. Finally, we propose that the rates obtained from kinetic
analysis provide a novel measure of clustering similarity that incorporates
information about the paths between them. This measure is robust to outliers
and we illustrate the application to datasets containing multiple outliers.
Related papers
- One-shot Robust Federated Learning of Independent Component Analysis [16.462282750354408]
We propose a geometric median-based aggregation algorithm that leverages $k$-means clustering to resolve the permutation ambiguity in local client estimations.<n>Our method first performs k-means to partition client-provided estimators into clusters and then aggregates estimators within each cluster using the geometric median.
arXiv Detail & Related papers (2025-05-26T21:37:19Z) - Strong bounds for large-scale Minimum Sum-of-Squares Clustering [0.9831489366502302]
Minimum Sum-of-Squares Clustering (MSSC) is one of the most widely used clustering methods.
MSSC aims to minimize the total squared Euclidean distance between data points and their corresponding cluster centroids.
We introduce a novel method to validate MSSC solutions through optimality gaps.
arXiv Detail & Related papers (2025-02-12T13:40:00Z) - Boosting K-means for Big Data by Fusing Data Streaming with Global Optimization [0.3069335774032178]
K-means clustering is a cornerstone of data mining, but its efficiency deteriorates when confronted with massive datasets.
We propose a novel algorithm that leverages the Variable Neighborhood Search (VNS) metaheuristic to optimize K-means clustering for big data.
arXiv Detail & Related papers (2024-10-18T15:43:34Z) - Fair Clustering for Data Summarization: Improved Approximation Algorithms and Complexity Insights [16.120911591795295]
In some applications all data points can be chosen as centers, in the general setting, centers must be chosen from a subset of points, referred as facilities or suppliers.
In this work, we focus on fair data summarization modeled as the fair $k$-supplier problem, where data consists of several groups, and a minimum number of centers must be selected from each group.
arXiv Detail & Related papers (2024-10-16T18:00:19Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Research on Efficient Fuzzy Clustering Method Based on Local Fuzzy
Granular balls [67.33923111887933]
In this paper, the data is fuzzy iterated using granular-balls, and the membership degree of data only considers the two granular-balls where it is located.
The formed fuzzy granular-balls set can use more processing methods in the face of different data scenarios.
arXiv Detail & Related papers (2023-03-07T01:52:55Z) - A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data [0.0]
We introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model.
The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression.
arXiv Detail & Related papers (2023-02-06T23:34:51Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - ThetA -- fast and robust clustering via a distance parameter [3.0020405188885815]
Clustering is a fundamental problem in machine learning where distance-based approaches have dominated the field for many decades.
We propose a new set of distance threshold methods called Theta-based Algorithms (ThetA)
arXiv Detail & Related papers (2021-02-13T23:16:33Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.