Supervised Learning and Model Analysis with Compositional Data
- URL: http://arxiv.org/abs/2205.07271v1
- Date: Sun, 15 May 2022 12:33:43 GMT
- Title: Supervised Learning and Model Analysis with Compositional Data
- Authors: Shimeng Huang, Elisabeth Ailer, Niki Kilbertus, Niklas Pfister
- Abstract summary: KernelBiome is a kernel-based non-parametric regression and classification framework for compositional data.
We demonstrate on par or improved performance compared with state-of-the-art machine learning methods.
- Score: 4.082799056366927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The compositionality and sparsity of high-throughput sequencing data poses a
challenge for regression and classification. However, in microbiome research in
particular, conditional modeling is an essential tool to investigate
relationships between phenotypes and the microbiome. Existing techniques are
often inadequate: they either rely on extensions of the linear log-contrast
model (which adjusts for compositionality, but is often unable to capture
useful signals), or they are based on black-box machine learning methods (which
may capture useful signals, but ignore compositionality in downstream
analyses).
We propose KernelBiome, a kernel-based nonparametric regression and
classification framework for compositional data. It is tailored to sparse
compositional data and is able to incorporate prior knowledge, such as
phylogenetic structure. KernelBiome captures complex signals, including in the
zero-structure, while automatically adapting model complexity. We demonstrate
on par or improved predictive performance compared with state-of-the-art
machine learning methods. Additionally, our framework provides two key
advantages: (i) We propose two novel quantities to interpret contributions of
individual components and prove that they consistently estimate average
perturbation effects of the conditional mean, extending the interpretability of
linear log-contrast models to nonparametric models. (ii) We show that the
connection between kernels and distances aids interpretability and provides a
data-driven embedding that can augment further analysis. Finally, we apply the
KernelBiome framework to two public microbiome studies and illustrate the
proposed model analysis. KernelBiome is available as an open-source Python
package at https://github.com/shimenghuang/KernelBiome.
Related papers
- Graph Representation Learning Strategies for Omics Data: A Case Study on Parkinson's Disease [13.630617713928197]
Graph neural networks have emerged as promising alternatives to classical statistical and machine learning methods.
This study evaluates various graph representation learning models for case-control classification.
We compare topologies derived from sample similarity networks and molecular interaction networks, including protein-protein and metabolite-metabolite interactions.
arXiv Detail & Related papers (2024-06-20T16:06:39Z) - Seeing Unseen: Discover Novel Biomedical Concepts via
Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues.
We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space.
A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z) - Geometric Graph Learning with Extended Atom-Types Features for
Protein-Ligand Binding Affinity Prediction [0.17132914341329847]
We upgrade the graph-based learners for the study of protein-ligand interactions by integrating extensive atom types such as SYBYL.
Our approach results in two different methods, namely $textsybyltextGGL$-Score and $texteciftextGGL$-Score.
While both of our models achieve state-of-the-art results, the SYBYL atom-type model $textsybyltextGGL$-Score outperforms other methods by a wide margin in all benchmarks.
arXiv Detail & Related papers (2023-01-15T21:30:21Z) - Heterogeneous Graph Neural Networks using Self-supervised Reciprocally
Contrastive Learning [102.9138736545956]
Heterogeneous graph neural network (HGNN) is a very popular technique for the modeling and analysis of heterogeneous graphs.
We develop for the first time a novel and robust heterogeneous graph contrastive learning approach, namely HGCL, which introduces two views on respective guidance of node attributes and graph topologies.
In this new approach, we adopt distinct but most suitable attribute and topology fusion mechanisms in the two views, which are conducive to mining relevant information in attributes and topologies separately.
arXiv Detail & Related papers (2022-04-30T12:57:02Z) - Hybrid Feature- and Similarity-Based Models for Prediction and
Interpretation using Large-Scale Observational Data [0.0]
We propose a hybrid feature- and similarity-based model for supervised learning.
The proposed hybrid model is fit by convex optimization with a sparsity-inducing penalty on the kernel portion.
We compared our models to solely feature- and similarity-based approaches using synthetic data and using EHR data to predict risk of loneliness or social isolation.
arXiv Detail & Related papers (2022-04-12T20:37:03Z) - Interpretable Single-Cell Set Classification with Kernel Mean Embeddings [14.686560033030101]
Kernel Mean Embedding encodes the cellular landscape of each profiled biological sample.
We train a simple linear classifier and achieve state-of-the-art classification accuracy on 3 flow and mass datasets.
arXiv Detail & Related papers (2022-01-18T21:40:36Z) - Learning physically consistent mathematical models from data using group
sparsity [2.580765958706854]
In areas like biology, high noise levels, sensor-induced correlations, and strong inter-system variability can render data-driven models nonsensical or physically inconsistent.
We show several applications from systems biology that demonstrate the benefits of enforcing $textitpriors$ in data-driven modeling.
arXiv Detail & Related papers (2020-12-11T14:45:38Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - A Trainable Optimal Transport Embedding for Feature Aggregation and its
Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference.
Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z) - Bayesian Sparse Factor Analysis with Kernelized Observations [67.60224656603823]
Multi-view problems can be faced with latent variable models.
High-dimensionality and non-linear issues are traditionally handled by kernel methods.
We propose merging both approaches into single model.
arXiv Detail & Related papers (2020-06-01T14:25:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.