Impact of Data Patterns on Biotype identification Using Machine Learning
- URL: http://arxiv.org/abs/2503.12066v1
- Date: Sat, 15 Mar 2025 09:44:00 GMT
- Title: Impact of Data Patterns on Biotype identification Using Machine Learning
- Authors: Yuetong Yu, Ruiyang Ge, Ilker Hacihaliloglu, Alexander Rauscher, Roger Tam, Sophia Frangou,
- Abstract summary: This study investigates the contribution of data patterns on algorithm performance by leveraging synthetic brain morphometry data as an exemplar.<n>SuStaIn failed to process datasets with more than 17 variables, highlighting computational inefficiencies.<n>SmileGAN and SurrealGAN outperformed other algorithms in identifying variable-based disease patterns, but these patterns were not able to provide individual-level classification.
- Score: 38.321248253111776
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Background: Patient stratification in brain disorders remains a significant challenge, despite advances in machine learning and multimodal neuroimaging. Automated machine learning algorithms have been widely applied for identifying patient subtypes (biotypes), but results have been inconsistent across studies. These inconsistencies are often attributed to algorithmic limitations, yet an overlooked factor may be the statistical properties of the input data. This study investigates the contribution of data patterns on algorithm performance by leveraging synthetic brain morphometry data as an exemplar. Methods: Four widely used algorithms-SuStaIn, HYDRA, SmileGAN, and SurrealGAN were evaluated using multiple synthetic pseudo-patient datasets designed to include varying numbers and sizes of clusters and degrees of complexity of morphometric changes. Ground truth, representing predefined clusters, allowed for the evaluation of performance accuracy across algorithms and datasets. Results: SuStaIn failed to process datasets with more than 17 variables, highlighting computational inefficiencies. HYDRA was able to perform individual-level classification in multiple datasets with no clear pattern explaining failures. SmileGAN and SurrealGAN outperformed other algorithms in identifying variable-based disease patterns, but these patterns were not able to provide individual-level classification. Conclusions: Dataset characteristics significantly influence algorithm performance, often more than algorithmic design. The findings emphasize the need for rigorous validation using synthetic data before real-world application and highlight the limitations of current clustering approaches in capturing the heterogeneity of brain disorders. These insights extend beyond neuroimaging and have implications for machine learning applications in biomedical research.
Related papers
- Learning to refine domain knowledge for biological network inference [2.209921757303168]
Perturbation experiments allow biologists to discover causal relationships between variables of interest.
The sparsity and high dimensionality of these data pose significant challenges for causal structure learning algorithms.
We propose an amortized algorithm for refining domain knowledge, based on data observations.
arXiv Detail & Related papers (2024-10-18T12:53:23Z) - Artificial Data Point Generation in Clustered Latent Space for Small
Medical Datasets [4.542616945567623]
This paper introduces a novel method, Artificial Data Point Generation in Clustered Latent Space (AGCL)
AGCL is designed to enhance classification performance on small medical datasets through synthetic data generation.
It was applied to Parkinson's disease screening, utilizing facial expression data.
arXiv Detail & Related papers (2024-09-26T09:51:08Z) - An Autoencoder and Generative Adversarial Networks Approach for Multi-Omics Data Imbalanced Class Handling and Classification [2.2940141855172036]
In molecular biology, there has been an explosion of data generated from multi-omics sequencing.
Traditional statistical methods face challenging tasks when dealing with such high dimensional data.
This study, focused on tackling these challenges in a neural network that incorporates autoencoder to extract latent space of the features.
arXiv Detail & Related papers (2024-05-16T01:45:55Z) - Amplifying Pathological Detection in EEG Signaling Pathways through
Cross-Dataset Transfer Learning [10.212217551908525]
We study the effectiveness of data and model scaling and cross-dataset knowledge transfer in a real-world pathology classification task.
We identify the challenges of possible negative transfer and emphasize the significance of some key components.
Our findings indicate a small and generic model (e.g. ShallowNet) performs well on a single dataset, however, a larger model (e.g. TCN) performs better on transfer and learning from a larger and diverse dataset.
arXiv Detail & Related papers (2023-09-19T20:09:15Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - RandomSCM: interpretable ensembles of sparse classifiers tailored for
omics data [59.4141628321618]
We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules.
The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
arXiv Detail & Related papers (2022-08-11T13:55:04Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Convolutional generative adversarial imputation networks for
spatio-temporal missing data in storm surge simulations [86.5302150777089]
Generative Adversarial Imputation Nets (GANs) and GAN-based techniques have attracted attention as unsupervised machine learning methods.
We name our proposed method as Con Conval Generative Adversarial Imputation Nets (Conv-GAIN)
arXiv Detail & Related papers (2021-11-03T03:50:48Z) - Object-Attribute Biclustering for Elimination of Missing Genotypes in
Ischemic Stroke Genome-Wide Data [2.0236506875465863]
Missing genotypes can affect the efficacy of machine learning approaches to identify the risk genetic variants of common diseases and traits.
The problem occurs when genotypic data are collected from different experiments with different DNA microarrays, each being characterised by its pattern of uncalled (missing) genotypes.
We use well-developed notions of object-attribute biclusters and formal concepts that correspond to dense subrelations in the binary relation.
arXiv Detail & Related papers (2020-10-22T12:27:43Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.