Related papers: Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions

Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions

URL: http://arxiv.org/abs/1903.05054v2
Date: Thu, 6 Jun 2024 12:30:18 GMT
Title: Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions
Authors: Alexa A. Sochaniwsky, Michael P. B. Gallaugher, Yang Tang, Paul D. McNicholas,
Abstract summary: Traditional approaches to model-based clustering often fail for high dimensional data. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. An analytically feasible expectation-maximization algorithm is developed.
Score: 6.839746711757701
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robust clustering of high-dimensional data is an important topic because clusters in real datasets are often heavy-tailed and/or asymmetric. Traditional approaches to model-based clustering often fail for high dimensional data, e.g., due to the number of free covariance parameters. A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed. This parameterization includes a penalty term in the likelihood. An analytically feasible expectation-maximization algorithm is developed by placing a gamma-lasso penalty constraining the concentration matrix. The proposed methodology is investigated through simulation studies and illustrated using two real datasets.

Related papers

A Hybrid Mixture Approach for Clustering and Characterizing Cancer Data [0.07673339435080444]
Modern biomedical data coming with high dimensions make it challenging to perform the model estimation in traditional cluster analysis.<n>We propose a hybrid matrix-free computational scheme to efficiently estimate the clusters and model parameters.<n>Our algorithms are applied to accurately identify and distinguish types of breast cancer based on large tumor samples, and to provide a generalized characterization for subtypes of lymphoma using massive gene records.
arXiv Detail & Related papers (2025-07-18T22:01:03Z)
A Hybrid Mixture of $t$-Factor Analyzers for Clustering High-dimensional Data [0.07673339435080444]
This paper develops a novel hybrid approach for estimating the mixture model of $t$-factor analyzers (MtFA) The effectiveness of our approach is demonstrated through simulations showcasing its superior computational efficiency compared to the existing method. Our method is applied to cluster the Gamma-ray bursts, reinforcing several claims in the literature that Gamma-ray bursts have heterogeneous subpopulations and providing characterizations of the estimated groups.
arXiv Detail & Related papers (2025-04-29T18:59:58Z)
Wrapped Gaussian on the manifold of Symmetric Positive Definite Matrices [6.7523635840772505]
Circular and non-flat data distributions are prevalent across diverse domains of data science. A principled approach to accounting for the underlying geometry of such data is pivotal. This work lays the groundwork for extending classical machine learning and statistical methods to more complex and structured data.
arXiv Detail & Related papers (2025-02-03T16:46:46Z)
Adaptive Fuzzy C-Means with Graph Embedding [84.47075244116782]
Fuzzy clustering algorithms can be roughly categorized into two main groups: Fuzzy C-Means (FCM) based methods and mixture model based methods. We propose a novel FCM based clustering model that is capable of automatically learning an appropriate membership degree hyper- parameter value.
arXiv Detail & Related papers (2024-05-22T08:15:50Z)
Diffusion posterior sampling for simulation-based inference in tall data settings [53.17563688225137]
Simulation-based inference ( SBI) is capable of approximating the posterior distribution that relates input parameters to a given observation. In this work, we consider a tall data extension in which multiple observations are available to better infer the parameters of the model. We compare our method to recently proposed competing approaches on various numerical experiments and demonstrate its superiority in terms of numerical stability and computational cost.
arXiv Detail & Related papers (2024-04-11T09:23:36Z)
Distributional Reduction: Unifying Dimensionality Reduction and Clustering with Gromov-Wasserstein [56.62376364594194]
Unsupervised learning aims to capture the underlying structure of potentially large and high-dimensional datasets. In this work, we revisit these approaches under the lens of optimal transport and exhibit relationships with the Gromov-Wasserstein problem. This unveils a new general framework, called distributional reduction, that recovers DR and clustering as special cases and allows addressing them jointly within a single optimization problem.
arXiv Detail & Related papers (2024-02-03T19:00:19Z)
A provable initialization and robust clustering method for general mixture models [6.806940901668607]
Clustering is a fundamental tool in statistical machine learning in the presence of heterogeneous data. Most recent results focus on optimal mislabeling guarantees when data are distributed around centroids with sub-Gaussian errors.
arXiv Detail & Related papers (2024-01-10T22:56:44Z)
Clustering based on Mixtures of Sparse Gaussian Processes [6.939768185086753]
How to cluster data using their low dimensional embedded space is still a challenging problem in machine learning. In this article, we focus on proposing a joint formulation for both clustering and dimensionality reduction. Our algorithm is based on a mixture of sparse Gaussian processes, which is called Sparse Gaussian Process Mixture Clustering (SGP-MIC)
arXiv Detail & Related papers (2023-03-23T20:44:36Z)
Learning Graphical Factor Models with Riemannian Optimization [70.13748170371889]
This paper proposes a flexible algorithmic framework for graph learning under low-rank structural constraints. The problem is expressed as penalized maximum likelihood estimation of an elliptical distribution. We leverage geometries of positive definite matrices and positive semi-definite matrices of fixed rank that are well suited to elliptical models.
arXiv Detail & Related papers (2022-10-21T13:19:45Z)
Likelihood Adjusted Semidefinite Programs for Clustering Heterogeneous Data [16.153709556346417]
Clustering is a widely deployed learning tool. iLA-SDP is less sensitive than EM to and more stable on high-dimensional data.
arXiv Detail & Related papers (2022-09-29T21:03:13Z)
A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data. A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data. Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z)
Spatially Coherent Clustering Based on Orthogonal Nonnegative Matrix Factorization [0.0]
We introduce in this work clustering models based on a total variation (TV) regularization procedure on the cluster membership matrix. We provide a numerical evaluation of all proposed methods on a hyperspectral dataset obtained from a matrix-assisted laser desorption/ionisation imaging measurement.
arXiv Detail & Related papers (2021-04-25T23:40:41Z)
Robust M-Estimation Based Bayesian Cluster Enumeration for Real Elliptically Symmetric Distributions [5.137336092866906]
Robustly determining optimal number of clusters in a data set is an essential factor in a wide range of applications. This article generalizes so that it can be used with any arbitrary Really Symmetric (RES) distributed mixture model. We derive a robust criterion for data sets with finite sample size, and also provide an approximation to reduce the computational cost at large sample sizes.
arXiv Detail & Related papers (2020-05-04T11:44:49Z)
Asymptotic Analysis of an Ensemble of Randomly Projected Linear Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets. We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator. We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.