Related papers: Optimal transport for automatic alignment of untargeted metabolomic data

Optimal transport for automatic alignment of untargeted metabolomic data

URL: http://arxiv.org/abs/2306.03218v4
Date: Fri, 24 May 2024 13:16:49 GMT
Title: Optimal transport for automatic alignment of untargeted metabolomic data
Authors: Marie Breeur, George Stepaniants, Pekka Keski-Rahkonen, Philippe Rigollet, Vivian Viallon,
Abstract summary: We introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness. We show how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.
Score: 8.692678207022084
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Untargeted metabolomic profiling through liquid chromatography-mass spectrometry (LC-MS) measures a vast array of metabolites within biospecimens, advancing drug development, disease diagnosis, and risk prediction. However, the low throughput of LC-MS poses a major challenge for biomarker discovery, annotation, and experimental comparison, necessitating the merging of multiple datasets. Current data pooling methods encounter practical limitations due to their vulnerability to data variations and hyperparameter dependence. Here we introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness compared to existing approaches. This algorithm scales to thousands of features requiring minimal hyperparameter tuning. Manually curated datasets for validating alignment algorithms are limited in the field of untargeted metabolomics, and hence we develop a dataset split procedure to generate pairs of validation datasets to test the alignments produced by GromovMatcher and other methods. Applying our method to experimental patient studies of liver and pancreatic cancer, we discover shared metabolic features related to patient alcohol intake, demonstrating how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.

Related papers

Interpretable Graph Kolmogorov-Arnold Networks for Multi-Cancer Classification and Biomarker Identification using Multi-Omics Data [36.92842246372894]
Multi-Omics Graph Kolmogorov-Arnold Network (MOGKAN) is a deep learning framework that utilizes messenger-RNA, micro-RNA sequences, and DNA methylation samples.<n>By integrating multi-omics data with graph-based deep learning, our proposed approach demonstrates robust predictive performance and interpretability.
arXiv Detail & Related papers (2025-03-29T02:14:05Z)
Enhanced ECG Arrhythmia Detection Accuracy by Optimizing Divergence-Based Data Fusion [5.575308369829893]
We propose a feature-based fusion algorithm utilizing Kernel Density Estimation (KDE) and Kullback-Leibler (KL) divergence. Using our in-house datasets consisting of ECG signals collected from 2000 healthy and 2000 diseased individuals, we verify our method by using the publicly available PTB-XL dataset. The results demonstrate that the proposed fusion method significantly enhances feature-based classification accuracy for abnormal ECG cases in the merged datasets.
arXiv Detail & Related papers (2025-03-19T12:16:48Z)
Comprehensive Metapath-based Heterogeneous Graph Transformer for Gene-Disease Association Prediction [19.803593399456823]
COmprehensive MEtapath-based heterogeneous graph Transformer(COMET) for predicting gene-disease associations. Our method demonstrates superior robustness compared to state-of-the-art approaches.
arXiv Detail & Related papers (2025-01-14T09:41:18Z)
Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery [56.622854875204645]
We present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth gene-gene interactions. A novel weighted diversified sampling algorithm computes the diversity score of each data sample in just two passes of the dataset.
arXiv Detail & Related papers (2024-10-21T03:35:23Z)
MMIL: A novel algorithm for disease associated cell type discovery [58.044870442206914]
Single-cell datasets often lack individual cell labels, making it challenging to identify cells associated with disease. We introduce Mixture Modeling for Multiple Learning Instance (MMIL), an expectation method that enables the training and calibration of cell-level classifiers.
arXiv Detail & Related papers (2024-06-12T15:22:56Z)
FORESEE: Multimodal and Multi-view Representation Learning for Robust Prediction of Cancer Survival [3.4686401890974197]
We propose a new end-to-end framework, FORESEE, for robustly predicting patient survival by mining multimodal information. Cross-fusion transformer effectively utilizes features at the cellular level, tissue level, and tumor heterogeneity level to correlate prognosis. The hybrid attention encoder (HAE) uses the denoising contextual attention module to obtain the contextual relationship features. We also propose an asymmetrically masked triplet masked autoencoder to reconstruct lost information within modalities.
arXiv Detail & Related papers (2024-05-13T12:39:08Z)
SELECTOR: Heterogeneous graph network with convolutional masked autoencoder for multimodal robust prediction of cancer survival [8.403756148610269]
Multimodal prediction of cancer patient survival offers a more comprehensive and precise approach. This paper introduces SELECTOR, a heterogeneous graph-aware network based on convolutional mask encoders. Our method significantly outperforms state-of-the-art methods in both modality-missing and intra-modality information-confirmed cases.
arXiv Detail & Related papers (2024-03-14T11:23:39Z)
Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training [75.40980802817349]
Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. We introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites.
arXiv Detail & Related papers (2023-08-31T00:36:10Z)
Topologically Regularized Multiple Instance Learning to Harness Data Scarcity [15.06687736543614]
Multiple Instance Learning models have emerged as a powerful tool to classify patients' microscopy samples. We introduce a topological regularization term to MIL to mitigate this challenge. We show an average enhancement of 2.8% for MIL benchmarks, 15.3% for synthetic MIL datasets, and 5.5% for real-world biomedical datasets over the current state-of-the-art.
arXiv Detail & Related papers (2023-07-26T08:14:18Z)
Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data [0.8029049649310213]
We propose a framework called Functional Integrative Bayesian Analysis of High-dimensional Multiplatform Genomic Data (fiBAG) fiBAG allows simultaneous identification of upstream functional evidence of proteogenomic biomarkers. We demonstrate the profitability of fiBAG via a pan-cancer analysis of 14 cancer types.
arXiv Detail & Related papers (2022-12-29T03:31:45Z)
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
Lung Cancer Lesion Detection in Histopathology Images Using Graph-Based Sparse PCA Network [93.22587316229954]
We propose a graph-based sparse principal component analysis (GS-PCA) network, for automated detection of cancerous lesions on histological lung slides stained by hematoxylin and eosin (H&E) We evaluate the performance of the proposed algorithm on H&E slides obtained from an SVM K-rasG12D lung cancer mouse model using precision/recall rates, F-score, Tanimoto coefficient, and area under the curve (AUC) of the receiver operator characteristic (ROC)
arXiv Detail & Related papers (2021-10-27T19:28:36Z)
Data-Driven Logistic Regression Ensembles With Applications in Genomics [0.0]
We propose a new approach for dealing with high-dimensional binary classification problems that combines ideas from regularization and ensembling. We demonstrate the good performance of our method in terms of prediction accuracy and identification of key biomarkers using several medical datasets involving common diseases such as cancer, multiple sclerosis and psoriasis.
arXiv Detail & Related papers (2021-02-17T05:57:26Z)
G-MIND: An End-to-End Multimodal Imaging-Genetics Framework for Biomarker Identification and Disease Classification [49.53651166356737]
We propose a novel deep neural network architecture to integrate imaging and genetics data, as guided by diagnosis, that provides interpretable biomarkers. We have evaluated our model on a population study of schizophrenia that includes two functional MRI (fMRI) paradigms and Single Nucleotide Polymorphism (SNP) data.
arXiv Detail & Related papers (2021-01-27T19:28:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.