Multi-View Variational Autoencoder for Missing Value Imputation in
Untargeted Metabolomics
- URL: http://arxiv.org/abs/2310.07990v2
- Date: Tue, 12 Mar 2024 15:34:13 GMT
- Title: Multi-View Variational Autoencoder for Missing Value Imputation in
Untargeted Metabolomics
- Authors: Chen Zhao, Kuan-Jui Su, Chong Wu, Xuewei Cao, Qiuying Sha, Wu Li, Zhe
Luo, Tian Qin, Chuan Qiu, Lan Juan Zhao, Anqi Liu, Lindong Jiang, Xiao Zhang,
Hui Shen, Weihua Zhou, Hong-Wen Deng
- Abstract summary: We propose a novel method that leverages the information from WGS data and reference metabolites to impute unknown metabolites.
By learning the latent representations of both omics data, our method can effectively impute missing metabolomics values.
- Score: 17.563099908890013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: Missing data is a common challenge in mass spectrometry-based
metabolomics, which can lead to biased and incomplete analyses. The integration
of whole-genome sequencing (WGS) data with metabolomics data has emerged as a
promising approach to enhance the accuracy of data imputation in metabolomics
studies. Method: In this study, we propose a novel method that leverages the
information from WGS data and reference metabolites to impute unknown
metabolites. Our approach utilizes a multi-view variational autoencoder to
jointly model the burden score, polygenetic risk score (PGS), and linkage
disequilibrium (LD) pruned single nucleotide polymorphisms (SNPs) for feature
extraction and missing metabolomics data imputation. By learning the latent
representations of both omics data, our method can effectively impute missing
metabolomics values based on genomic information. Results: We evaluate the
performance of our method on empirical metabolomics datasets with missing
values and demonstrate its superiority compared to conventional imputation
techniques. Using 35 template metabolites derived burden scores, PGS and
LD-pruned SNPs, the proposed methods achieved R^2-scores > 0.01 for 71.55% of
metabolites. Conclusion: The integration of WGS data in metabolomics imputation
not only improves data completeness but also enhances downstream analyses,
paving the way for more comprehensive and accurate investigations of metabolic
pathways and disease associations. Our findings offer valuable insights into
the potential benefits of utilizing WGS data for metabolomics data imputation
and underscore the importance of leveraging multi-modal data integration in
precision medicine research.
Related papers
- Gene-Metabolite Association Prediction with Interactive Knowledge Transfer Enhanced Graph for Metabolite Production [49.814615043389864]
We propose a new task, Gene-Metabolite Association Prediction based on metabolic graphs.
We present the first benchmark containing 2474 metabolites and 1947 genes of two commonly used microorganisms.
Our proposed methodology outperforms baselines by up to 12.3% across various link prediction frameworks.
arXiv Detail & Related papers (2024-10-24T06:54:27Z) - Meta-Learning on Augmented Gene Expression Profiles for Enhanced Lung Cancer Detection [3.7929238927240685]
We present a meta-learning-based approach for predicting lung cancer from gene expression profiles.
We employ four distinct datasets for the meta-learning tasks, where one as the target dataset and the rest as source datasets.
Results show the superior performance of meta-learning on augmented source data compared to the baselines trained on single datasets.
arXiv Detail & Related papers (2024-08-19T01:39:12Z) - MMIL: A novel algorithm for disease associated cell type discovery [58.044870442206914]
Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease.
We introduce Mixture Modeling for Multiple Learning Instance (MMIL), an expectation method that enables the training and calibration of cell-level classifiers.
arXiv Detail & Related papers (2024-06-12T15:22:56Z) - Integrate Any Omics: Towards genome-wide data integration for patient
stratification [6.893309898200498]
IntegrAO is an unsupervised framework for integrating incomplete multi-omics data and classifying new samples.
IntegrAO's ability to handle heterogeneous and incomplete data makes it an essential tool for precision oncology.
arXiv Detail & Related papers (2024-01-15T19:57:07Z) - Optimal transport for automatic alignment of untargeted metabolomic data [8.692678207022084]
We introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport.
By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness.
We show how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.
arXiv Detail & Related papers (2023-06-05T20:08:19Z) - Functional Integrative Bayesian Analysis of High-dimensional
Multiplatform Genomic Data [0.8029049649310213]
We propose a framework called Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data (fiBAG)
fiBAG allows simultaneous identification of upstream functional evidence of proteogenomic biomarkers.
We demonstrate the profitability of fiBAG via a pan-cancer analysis of 14 cancer types.
arXiv Detail & Related papers (2022-12-29T03:31:45Z) - RandomSCM: interpretable ensembles of sparse classifiers tailored for
omics data [59.4141628321618]
We propose an ensemble learning algorithm based on conjunctions or disjunctions of decision rules.
The interpretability of the models makes them useful for biomarker discovery and patterns discovery in high dimensional data.
arXiv Detail & Related papers (2022-08-11T13:55:04Z) - A robust kernel machine regression towards biomarker selection in
multi-omics datasets of osteoporosis for drug discovery [2.2897244874280043]
We propose "robust kernel machine regression (RobMR)," to improve the robustness of statistical machine regression and the diversity of fictional data.
Experiments demonstrate that the proposed approach effectively identifies the inter-related risk factors of osteoporosis.
The proposed approach can be applied be to any disease model multi-omics datasets are available.
arXiv Detail & Related papers (2022-01-13T16:39:46Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for
Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers.
We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z) - Trajectories, bifurcations and pseudotime in large clinical datasets:
applications to myocardial infarction and diabetes data [94.37521840642141]
We suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values.
The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations.
arXiv Detail & Related papers (2020-07-07T21:04:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.