Related papers: Supervised Learning and Model Analysis with Compositional Data

Supervised Learning and Model Analysis with Compositional Data

URL: http://arxiv.org/abs/2205.07271v1
Date: Sun, 15 May 2022 12:33:43 GMT
Title: Supervised Learning and Model Analysis with Compositional Data
Authors: Shimeng Huang, Elisabeth Ailer, Niki Kilbertus, Niklas Pfister
Abstract summary: KernelBiome is a kernel-based non-parametric regression and classification framework for compositional data. We demonstrate on par or improved performance compared with state-of-the-art machine learning methods.
Score: 4.082799056366927
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The compositionality and sparsity of high-throughput sequencing data poses a challenge for regression and classification. However, in microbiome research in particular, conditional modeling is an essential tool to investigate relationships between phenotypes and the microbiome. Existing techniques are often inadequate: they either rely on extensions of the linear log-contrast model (which adjusts for compositionality, but is often unable to capture useful signals), or they are based on black-box machine learning methods (which may capture useful signals, but ignore compositionality in downstream analyses). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast models to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. Finally, we apply the KernelBiome framework to two public microbiome studies and illustrate the proposed model analysis. KernelBiome is available as an open-source Python package at https://github.com/shimenghuang/KernelBiome.

Related papers

Graph Representation Learning Strategies for Omics Data: A Case Study on Parkinson's Disease [13.630617713928197]
Graph neural networks have emerged as promising alternatives to classical statistical and machine learning methods. This study evaluates various graph representation learning models for case-control classification. We compare topologies derived from sample similarity networks and molecular interaction networks, including protein-protein and metabolite-metabolite interactions.
arXiv Detail & Related papers (2024-06-20T16:06:39Z)
Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling [53.7117640028211]
We present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. We incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space. A spectral graph-theoretic method is devised to estimate the number of potential novel classes.
arXiv Detail & Related papers (2024-03-02T00:56:05Z)
Geometric Graph Learning with Extended Atom-Types Features for Protein-Ligand Binding Affinity Prediction [0.17132914341329847]
We upgrade the graph-based learners for the study of protein-ligand interactions by integrating extensive atom types such as SYBYL. Our approach results in two different methods, namely $textsybyltextGGL$-Score and $texteciftextGGL$-Score. While both of our models achieve state-of-the-art results, the SYBYL atom-type model $textsybyltextGGL$-Score outperforms other methods by a wide margin in all benchmarks.
arXiv Detail & Related papers (2023-01-15T21:30:21Z)
Heterogeneous Graph Neural Networks using Self-supervised Reciprocally Contrastive Learning [102.9138736545956]
Heterogeneous graph neural network (HGNN) is a very popular technique for the modeling and analysis of heterogeneous graphs. We develop for the first time a novel and robust heterogeneous graph contrastive learning approach, namely HGCL, which introduces two views on respective guidance of node attributes and graph topologies. In this new approach, we adopt distinct but most suitable attribute and topology fusion mechanisms in the two views, which are conducive to mining relevant information in attributes and topologies separately.
arXiv Detail & Related papers (2022-04-30T12:57:02Z)
Hybrid Feature- and Similarity-Based Models for Prediction and Interpretation using Large-Scale Observational Data [0.0]
We propose a hybrid feature- and similarity-based model for supervised learning. The proposed hybrid model is fit by convex optimization with a sparsity-inducing penalty on the kernel portion. We compared our models to solely feature- and similarity-based approaches using synthetic data and using EHR data to predict risk of loneliness or social isolation.
arXiv Detail & Related papers (2022-04-12T20:37:03Z)
Interpretable Single-Cell Set Classification with Kernel Mean Embeddings [14.686560033030101]
Kernel Mean Embedding encodes the cellular landscape of each profiled biological sample. We train a simple linear classifier and achieve state-of-the-art classification accuracy on 3 flow and mass datasets.
arXiv Detail & Related papers (2022-01-18T21:40:36Z)
Learning physically consistent mathematical models from data using group sparsity [2.580765958706854]
In areas like biology, high noise levels, sensor-induced correlations, and strong inter-system variability can render data-driven models nonsensical or physically inconsistent. We show several applications from systems biology that demonstrate the benefits of enforcing $textitpriors$ in data-driven modeling.
arXiv Detail & Related papers (2020-12-11T14:45:38Z)
Towards an Automatic Analysis of CHO-K1 Suspension Growth in Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data. Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)
Bayesian Sparse Factor Analysis with Kernelized Observations [67.60224656603823]
Multi-view problems can be faced with latent variable models. High-dimensionality and non-linear issues are traditionally handled by kernel methods. We propose merging both approaches into single model.
arXiv Detail & Related papers (2020-06-01T14:25:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.