A Hybrid Mixture Approach for Clustering and Characterizing Cancer Data
- URL: http://arxiv.org/abs/2507.14380v1
- Date: Fri, 18 Jul 2025 22:01:03 GMT
- Title: A Hybrid Mixture Approach for Clustering and Characterizing Cancer Data
- Authors: Kazeem Kareem, Fan Dai,
- Abstract summary: Modern biomedical data coming with high dimensions make it challenging to perform the model estimation in traditional cluster analysis.<n>We propose a hybrid matrix-free computational scheme to efficiently estimate the clusters and model parameters.<n>Our algorithms are applied to accurately identify and distinguish types of breast cancer based on large tumor samples, and to provide a generalized characterization for subtypes of lymphoma using massive gene records.
- Score: 0.07673339435080444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-based clustering is widely used for identifying and distinguishing types of diseases. However, modern biomedical data coming with high dimensions make it challenging to perform the model estimation in traditional cluster analysis. The incorporation of factor analyzer into the mixture model provides a way to characterize the large set of data features, but the current estimation method is computationally impractical for massive data due to the intrinsic slow convergence of the embedded algorithms, and the incapability to vary the size of the factor analyzers, preventing the implementation of a generalized mixture of factor analyzers and further characterization of the data clusters. We propose a hybrid matrix-free computational scheme to efficiently estimate the clusters and model parameters based on a Gaussian mixture along with generalized factor analyzers to summarize the large number of variables using a small set of underlying factors. Our approach outperforms the existing method with faster convergence while maintaining high clustering accuracy. Our algorithms are applied to accurately identify and distinguish types of breast cancer based on large tumor samples, and to provide a generalized characterization for subtypes of lymphoma using massive gene records.
Related papers
- A Hybrid Mixture of $t$-Factor Analyzers for Clustering High-dimensional Data [0.07673339435080444]
This paper develops a novel hybrid approach for estimating the mixture model of $t$-factor analyzers (MtFA)<n>The effectiveness of our approach is demonstrated through simulations showcasing its superior computational efficiency compared to the existing method.<n>Our method is applied to cluster the Gamma-ray bursts, reinforcing several claims in the literature that Gamma-ray bursts have heterogeneous subpopulations and providing characterizations of the estimated groups.
arXiv Detail & Related papers (2025-04-29T18:59:58Z) - Simple and Scalable Algorithms for Cluster-Aware Precision Medicine [0.0]
We propose a simple and scalable approach to joint clustering and embedding.
This novel, cluster-aware embedding approach overcomes the complexity and limitations of current joint embedding and clustering methods.
Our approach does not require the user to choose the desired number of clusters, but instead yields interpretable dendrograms of hierarchically clustered embeddings.
arXiv Detail & Related papers (2022-11-29T19:27:26Z) - RandomSCM: interpretable ensembles of sparse classifiers tailored for
omics data [59.4141628321618]
We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules.
The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
arXiv Detail & Related papers (2022-08-11T13:55:04Z) - Generalization Metrics for Practical Quantum Advantage in Generative
Models [68.8204255655161]
Generative modeling is a widely accepted natural use case for quantum computers.
We construct a simple and unambiguous approach to probe practical quantum advantage for generative modeling by measuring the algorithm's generalization performance.
Our simulation results show that our quantum-inspired models have up to a $68 times$ enhancement in generating unseen unique and valid samples.
arXiv Detail & Related papers (2022-01-21T16:35:35Z) - Tk-merge: Computationally Efficient Robust Clustering Under General
Assumptions [0.0]
We present a two-step hybrid robust clustering algorithm based on trimmed k-means and hierarchical agglomeration.
We also present natural generalizations of the approach as well as an adaptive procedure to estimate the amount of contamination in a data-driven fashion.
arXiv Detail & Related papers (2022-01-17T13:05:05Z) - Learning Gaussian Mixtures with Generalised Linear Models: Precise
Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks.
We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation.
We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z) - Generalized Matrix Factorization: efficient algorithms for fitting
generalized linear latent variable models to large data arrays [62.997667081978825]
Generalized Linear Latent Variable models (GLLVMs) generalize such factor models to non-Gaussian responses.
Current algorithms for estimating model parameters in GLLVMs require intensive computation and do not scale to large datasets.
We propose a new approach for fitting GLLVMs to high-dimensional datasets, based on approximating the model using penalized quasi-likelihood.
arXiv Detail & Related papers (2020-10-06T04:28:19Z) - A Systematic Approach to Featurization for Cancer Drug Sensitivity
Predictions with Deep Learning [49.86828302591469]
We train >35,000 neural network models, sweeping over common featurization techniques.
We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features.
arXiv Detail & Related papers (2020-04-30T20:42:17Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z) - Flexible Clustering with a Sparse Mixture of Generalized Hyperbolic Distributions [6.839746711757701]
Traditional approaches to model-based clustering often fail for high dimensional data.
A parametrization of the component scale matrices for the mixture of generalized hyperbolic distributions is proposed.
An analytically feasible expectation-maximization algorithm is developed.
arXiv Detail & Related papers (2019-03-12T17:02:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.