On uniqueness of the set of k-means
- URL: http://arxiv.org/abs/2410.13495v1
- Date: Thu, 17 Oct 2024 12:40:56 GMT
- Title: On uniqueness of the set of k-means
- Authors: Javier Cárcamo, Antonio Cuevas, Luis A. Rodríguez,
- Abstract summary: We give an assessment on consistency of the empirical k-means adapted to the setting of non-uniqueness.
We derive a bootstrap test for uniqueness of the set of k-means.
The results are illustrated with examples of different types of non-uniqueness.
- Score: 0.5735035463793009
- License:
- Abstract: We provide necessary and sufficient conditions for the uniqueness of the k-means set of a probability distribution. This uniqueness problem is related to the choice of k: depending on the underlying distribution, some values of this parameter could lead to multiple sets of k-means, which hampers the interpretation of the results and/or the stability of the algorithms. We give a general assessment on consistency of the empirical k-means adapted to the setting of non-uniqueness and determine the asymptotic distribution of the within cluster sum of squares (WCSS). We also provide statistical characterizations of k-means uniqueness in terms of the asymptotic behavior of the empirical WCSS. As a consequence, we derive a bootstrap test for uniqueness of the set of k-means. The results are illustrated with examples of different types of non-uniqueness and we check by simulations the performance of the proposed methodology.
Related papers
- Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences.
Our method is especially suitable for problems with well-specified likelihoods.
We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z) - Bayesian Quantification with Black-Box Estimators [1.599072005190786]
Approaches like adjusted classify and count, black-box shift estimators, and invariant ratio estimators use an auxiliary (and potentially biased) black-box classifier to estimate the class distribution and yield guarantees under weak assumptions.
We demonstrate that all these algorithms are closely related to the inference in a particular Bayesian Chain model, approxing the assumed ground-truthgenerative process.
Then, we discuss an efficient Markov Monte Carlo sampling scheme for the introduced model and show an consistency guarantee in the large-data limit.
arXiv Detail & Related papers (2023-02-17T22:10:04Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Selective inference for k-means clustering [0.0]
We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering.
We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
arXiv Detail & Related papers (2022-03-29T06:28:12Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - A Nonparametric Test of Dependence Based on Ensemble of Decision Trees [0.0]
The proposed coefficient is a permutation-like statistic that quantifies how much the observed sample S_n : (X_i, Y_i), i = 1.
n is discriminable from the permutated sample S_nn : (X_i, Y_j), i, j = 1.
n, where the two variables are independent.
arXiv Detail & Related papers (2020-07-24T02:48:33Z) - Instability, Computational Efficiency and Statistical Accuracy [101.32305022521024]
We develop a framework that yields statistical accuracy based on interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (instability) when applied to an empirical object based on $n$ samples.
We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models.
arXiv Detail & Related papers (2020-05-22T22:30:52Z) - Robust M-Estimation Based Bayesian Cluster Enumeration for Real
Elliptically Symmetric Distributions [5.137336092866906]
Robustly determining optimal number of clusters in a data set is an essential factor in a wide range of applications.
This article generalizes so that it can be used with any arbitrary Really Symmetric (RES) distributed mixture model.
We derive a robust criterion for data sets with finite sample size, and also provide an approximation to reduce the computational cost at large sample sizes.
arXiv Detail & Related papers (2020-05-04T11:44:49Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Profile Entropy: A Fundamental Measure for the Learnability and
Compressibility of Discrete Distributions [63.60499266361255]
We show that for samples of discrete distributions, profile entropy is a fundamental measure unifying the concepts of estimation, inference, and compression.
Specifically, profile entropy a) determines the speed of estimating the distribution relative to the best natural estimator; b) characterizes the rate of inferring all symmetric properties compared with the best estimator over any label-invariant distribution collection; c) serves as the limit of profile compression.
arXiv Detail & Related papers (2020-02-26T17:49:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.