Related papers: SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene Expression

SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene Expression

URL: http://arxiv.org/abs/2012.08671v2
Date: Thu, 17 Dec 2020 05:58:21 GMT
Title: SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene Expression
Authors: Wei Cheng, Ghulam Murtaza, Aaron Wang
Abstract summary: We present SimpleChrome, a deep learning model that learns the histone modification representations of genes. The features learned from the model allow us to better understand the latent effects of cross-gene interactions and direct gene regulation on the target gene expression.
Score: 8.326669256957352
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to recent breakthroughs in state-of-the-art DNA sequencing technology, genomics data sets have become ubiquitous. The emergence of large-scale data sets provides great opportunities for better understanding of genomics, especially gene regulation. Although each cell in the human body contains the same set of DNA information, gene expression controls the functions of these cells by either turning genes on or off, known as gene expression levels. There are two important factors that control the expression level of each gene: (1) Gene regulation such as histone modifications can directly regulate gene expression. (2) Neighboring genes that are functionally related to or interact with each other that can also affect gene expression level. Previous efforts have tried to address the former using Attention-based model. However, addressing the second problem requires the incorporation of all potentially related gene information into the model. Though modern machine learning and deep learning models have been able to capture gene expression signals when applied to moderately sized data, they have struggled to recover the underlying signals of the data due to the nature of the data's higher dimensionality. To remedy this issue, we present SimpleChrome, a deep learning model that learns the latent histone modification representations of genes. The features learned from the model allow us to better understand the combinatorial effects of cross-gene interactions and direct gene regulation on the target gene expression. The results of this paper show outstanding improvements on the predictive capabilities of downstream models and greatly relaxes the need for a large data set to learn a robust, generalized neural network. These results have immediate downstream effects in epigenomics research and drug development.

Related papers

GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
Learning to Discover Regulatory Elements for Gene Expression Prediction [59.470991831978516]
Seq2Exp is a Sequence to Expression network designed to discover and extract regulatory elements that drive target gene expression. Our approach captures the causal relationship between epigenomic signals, DNA sequences and their associated regulatory elements.
arXiv Detail & Related papers (2025-02-19T03:25:49Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
Cross-Attention Graph Neural Networks for Inferring Gene Regulatory Networks with Skewed Degree Distribution [9.919024883502322]
Cross-Attention Complex Dual Graph Embedding Model (XATGRN) Our model consistently outperforms existing state-of-the-art methods across various datasets.
arXiv Detail & Related papers (2024-12-18T10:56:40Z)
GeneQuery: A General QA-based Framework for Spatial Gene Expression Predictions from Histology Images [41.732831871866516]
Whole-slide hematoxylin and eosin stained histological images are readily accessible and allow for detailed examinations of tissue structure and composition at the microscopic level. Recent advancements have utilized these histological images to predict spatially resolved gene expression profiles. GeneQuery aims to solve this gene expression prediction task in a question-answering (QA) manner for better generality and flexibility.
arXiv Detail & Related papers (2024-11-27T14:33:13Z)
Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances. BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules. BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
A Comparative Analysis of Gene Expression Profiling by Statistical and Machine Learning Approaches [1.8954222800767324]
We discuss the biological and the methodological limitations of machine learning models to classify cancer samples. Gene rankings are obtained from explainability methods adapted to these models. We observe that the information learned by black-box neural networks is related to the notion of differential expression.
arXiv Detail & Related papers (2024-02-01T18:17:36Z)
MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data [22.938437500266847]
We introduce a novel model called Multimodal Similarity Learning Graph Neural Network. It combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations from single-cell sequencing and spatial transcriptomic data. Our model efficiently produces unified gene representations for the analysis of gene functions, tissue functions, diseases, and species evolution.
arXiv Detail & Related papers (2023-09-29T13:33:53Z)
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z)
Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z)
SemanticCAP: Chromatin Accessibility Prediction Enhanced by Features Learning from a Language Model [3.0643865202019698]
We propose a new solution named SemanticCAP to identify accessible regions of the genome. It introduces a gene language model which models the context of gene sequences, thus being able to provide an effective representation of gene sequences. Compared with other systems under public benchmarks, our model proved to have better performance.
arXiv Detail & Related papers (2022-04-05T11:47:58Z)
VEGN: Variant Effect Prediction with Graph Neural Networks [19.59965282985234]
We propose VEGN, which models variant effect prediction using a graph neural network (GNN) that operates on a heterogeneous graph with genes and variants. The graph is created by assigning variants to genes and connecting genes with an gene-gene interaction network. VeGN improves the performance of existing state-of-the-art models.
arXiv Detail & Related papers (2021-06-25T13:51:46Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
Handling highly correlated genes in prediction analysis of genomic studies [0.0]
High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models. We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection. Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change.
arXiv Detail & Related papers (2020-07-05T22:14:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.