Neural network facilitated ab initio derivation of linear formula: A
case study on formulating the relationship between DNA motifs and gene
expression
- URL: http://arxiv.org/abs/2208.09559v1
- Date: Fri, 19 Aug 2022 22:29:30 GMT
- Title: Neural network facilitated ab initio derivation of linear formula: A
case study on formulating the relationship between DNA motifs and gene
expression
- Authors: Chengyu Liu, Wei Wang
- Abstract summary: We propose a framework for ab initio derivation of sequence motifs and linear formula using a new approach based on the interpretable neural network model.
We showed that this linear model could predict gene expression levels using promoter sequences with a performance comparable to deep neural network models.
- Score: 8.794181445664243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing models with high interpretability and even deriving formulas to
quantify relationships between biological data is an emerging need. We propose
here a framework for ab initio derivation of sequence motifs and linear formula
using a new approach based on the interpretable neural network model called
contextual regression model. We showed that this linear model could predict
gene expression levels using promoter sequences with a performance comparable
to deep neural network models. We uncovered a list of 300 motifs with important
regulatory roles on gene expression and showed that they also had significant
contributions to cell-type specific gene expression in 154 diverse cell types.
This work illustrates the possibility of deriving formulas to represent biology
laws that may not be easily elucidated.
(https://github.com/Wang-lab-UCSD/Motif_Finding_Contextual_Regression)
Related papers
- Long-range gene expression prediction with token alignment of large language model [37.10820914895689]
We introduce Genetic sequence Token Alignment (GTA), which aligns genetic sequence features with natural language tokens.
GTA learns the regulatory grammar and allows us to further incorporate gene-specific human annotations as prompts.
GTA represents a powerful and novel cross-modal approach to gene expression prediction by utilizing a pretrained language model.
arXiv Detail & Related papers (2024-10-02T02:42:29Z) - Generating Multi-Modal and Multi-Attribute Single-Cell Counts with CFGen [76.02070962797794]
We present Cell Flow for Generation, a flow-based conditional generative model for multi-modal single-cell counts.
Our results suggest improved recovery of crucial biological data characteristics while accounting for novel generative tasks.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - A Comparative Analysis of Gene Expression Profiling by Statistical and
Machine Learning Approaches [1.8954222800767324]
We discuss the biological and the methodological limitations of machine learning models to classify cancer samples.
Gene rankings are obtained from explainability methods adapted to these models.
We observe that the information learned by black-box neural networks is related to the notion of differential expression.
arXiv Detail & Related papers (2024-02-01T18:17:36Z) - MuSe-GNN: Learning Unified Gene Representation From Multimodal
Biological Graph Data [22.938437500266847]
We introduce a novel model called Multimodal Similarity Learning Graph Neural Network.
It combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations from single-cell sequencing and spatial transcriptomic data.
Our model efficiently produces unified gene representations for the analysis of gene functions, tissue functions, diseases, and species evolution.
arXiv Detail & Related papers (2023-09-29T13:33:53Z) - Unsupervised ensemble-based phenotyping helps enhance the
discoverability of genes related to heart morphology [57.25098075813054]
We propose a new framework for gene discovery entitled Un Phenotype Ensembles.
It builds a redundant yet highly expressive representation by pooling a set of phenotypes learned in an unsupervised manner.
These phenotypes are then analyzed via (GWAS), retaining only highly confident and stable associations.
arXiv Detail & Related papers (2023-01-07T18:36:44Z) - A single-cell gene expression language model [2.9112649816695213]
We propose a machine learning system to learn context dependencies between genes.
Our model, Exceiver, is trained across a diversity of cell types using a self-supervised task.
We found agreement between the similarity profiles of latent sample representations and learned gene embeddings with respect to biological annotations.
arXiv Detail & Related papers (2022-10-25T20:52:19Z) - Graph neural networks for the prediction of molecular structure-property
relationships [59.11160990637615]
Graph neural networks (GNNs) are a novel machine learning method that directly work on the molecular graph.
GNNs allow to learn properties in an end-to-end fashion, thereby avoiding the need for informative descriptors.
We describe the fundamentals of GNNs and demonstrate the application of GNNs via two examples for molecular property prediction.
arXiv Detail & Related papers (2022-07-25T11:30:44Z) - rfPhen2Gen: A machine learning based association study of brain imaging
phenotypes to genotypes [71.1144397510333]
We learned machine learning models to predict SNPs using 56 brain imaging QTs.
SNPs within the known Alzheimer disease (AD) risk gene APOE had lowest RMSE for lasso and random forest.
Random forests identified additional SNPs that were not prioritized by the linear models but are known to be associated with brain-related disorders.
arXiv Detail & Related papers (2022-03-31T20:15:22Z) - Self-Supervised Graph Representation Learning for Neuronal Morphologies [75.38832711445421]
We present GraphDINO, a data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled datasets.
We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings on par with manual feature-based classification by experts.
Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets.
arXiv Detail & Related papers (2021-12-23T12:17:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.