Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous
Data
- URL: http://arxiv.org/abs/2209.15097v2
- Date: Mon, 29 May 2023 00:15:59 GMT
- Title: Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous
Data
- Authors: Yubo Zhuang, Xiaohui Chen, Yun Yang
- Abstract summary: Clustering is a widely deployed learning tool.
iLA-SDP is less sensitive than EM to and more stable on high-dimensional data.
- Score: 16.153709556346417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering is a widely deployed unsupervised learning tool. Model-based
clustering is a flexible framework to tackle data heterogeneity when the
clusters have different shapes. Likelihood-based inference for mixture
distributions often involves non-convex and high-dimensional objective
functions, imposing difficult computational and statistical challenges. The
classic expectation-maximization (EM) algorithm is a computationally thrifty
iterative method that maximizes a surrogate function minorizing the
log-likelihood of observed data in each iteration, which however suffers from
bad local maxima even in the special case of the standard Gaussian mixture
model with common isotropic covariance matrices. On the other hand, recent
studies reveal that the unique global solution of a semidefinite programming
(SDP) relaxed $K$-means achieves the information-theoretically sharp threshold
for perfectly recovering the cluster labels under the standard Gaussian mixture
model. In this paper, we extend the SDP approach to a general setting by
integrating cluster labels as model parameters and propose an iterative
likelihood adjusted SDP (iLA-SDP) method that directly maximizes the exact
observed likelihood in the presence of data heterogeneity. By lifting the
cluster assignment to group-specific membership matrices, iLA-SDP avoids
centroids estimation -- a key feature that allows exact recovery under
well-separateness of centroids without being trapped by their adversarial
configurations. Thus iLA-SDP is less sensitive than EM to initialization and
more stable on high-dimensional data. Our numeric experiments demonstrate that
iLA-SDP can achieve lower mis-clustering errors over several widely used
clustering methods including $K$-means, SDP and EM algorithms.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Adaptive Fuzzy C-Means with Graph Embedding [84.47075244116782]
Fuzzy clustering algorithms can be roughly categorized into two main groups: Fuzzy C-Means (FCM) based methods and mixture model based methods.
We propose a novel FCM based clustering model that is capable of automatically learning an appropriate membership degree hyper- parameter value.
arXiv Detail & Related papers (2024-05-22T08:15:50Z) - A provable initialization and robust clustering method for general mixture models [6.806940901668607]
Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data.
Most recent results focus on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors.
arXiv Detail & Related papers (2024-01-10T22:56:44Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - A Generalized Framework for Predictive Clustering and Optimization [18.06697544912383]
Clustering is a powerful and extensively used data science tool.
In this article, we define a generalized optimization framework for predictive clustering.
We also present a joint optimization strategy that exploits mixed-integer linear programming (MILP) for global optimization.
arXiv Detail & Related papers (2023-05-07T19:56:51Z) - A distribution-free mixed-integer optimization approach to hierarchical modelling of clustered and longitudinal data [0.0]
We introduce an innovative algorithm that evaluates cluster effects for new data points, thereby increasing the robustness and precision of this model.
The inferential and predictive efficacy of this approach is further illustrated through its application in student scoring and protein expression.
arXiv Detail & Related papers (2023-02-06T23:34:51Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Personalized Federated Learning via Convex Clustering [72.15857783681658]
We propose a family of algorithms for personalized federated learning with locally convex user costs.
The proposed framework is based on a generalization of convex clustering in which the differences between different users' models are penalized.
arXiv Detail & Related papers (2022-02-01T19:25:31Z) - Kernel learning approaches for summarising and combining posterior
similarity matrices [68.8204255655161]
We build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian clustering models.
A key contribution of our work is the observation that PSMs are positive semi-definite, and hence can be used to define probabilistically-motivated kernel matrices.
arXiv Detail & Related papers (2020-09-27T14:16:14Z) - Robust M-Estimation Based Bayesian Cluster Enumeration for Real
Elliptically Symmetric Distributions [5.137336092866906]
Robustly determining optimal number of clusters in a data set is an essential factor in a wide range of applications.
This article generalizes so that it can be used with any arbitrary Really Symmetric (RES) distributed mixture model.
We derive a robust criterion for data sets with finite sample size, and also provide an approximation to reduce the computational cost at large sample sizes.
arXiv Detail & Related papers (2020-05-04T11:44:49Z) - Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions [6.839746711757701]
Traditional approaches to model-based clustering often fail for high dimensional data.
A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed.
An analytically feasible expectation-maximization algorithm is developed.
arXiv Detail & Related papers (2019-03-12T17:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.