Related papers: UNADON: Transformer-based model to predict genome-wide chromosome spatial position

UNADON: Transformer-based model to predict genome-wide chromosome spatial position

URL: http://arxiv.org/abs/2304.13230v2
Date: Sat, 1 Jul 2023 05:29:14 GMT
Title: UNADON: Transformer-based model to predict genome-wide chromosome spatial position
Authors: Muyu Yang and Jian Ma
Abstract summary: We develop a new transformer-based deep learning model called UNADON. It predicts the genome-wide cytological distance to a specific type of nuclear body. It reveals potential sequence and epigenomic factors that affect large-scale compartmentalization to nuclear bodies.
Score: 2.3980064191633232
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The spatial positioning of chromosomes relative to functional nuclear bodies is intertwined with genome functions such as transcription. However, the sequence patterns and epigenomic features that collectively influence chromatin spatial positioning in a genome-wide manner are not well understood. Here, we develop a new transformer-based deep learning model called UNADON, which predicts the genome-wide cytological distance to a specific type of nuclear body, as measured by TSA-seq, using both sequence features and epigenomic signals. Evaluations of UNADON in four cell lines (K562, H1, HFFc6, HCT116) show high accuracy in predicting chromatin spatial positioning to nuclear bodies when trained on a single cell line. UNADON also performed well in an unseen cell type. Importantly, we reveal potential sequence and epigenomic factors that affect large-scale chromatin compartmentalization to nuclear bodies. Together, UNADON provides new insights into the principles between sequence features and large-scale chromatin spatial localization, which has important implications for understanding nuclear structure and function.

Related papers

HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data [13.66950862644406]
We introduce HEIST, a hierarchical graph transformer-based model for spatial transcriptomics data.<n>HEIST is pre-trained on 22.3M cells from 124 tissues across 15 organs.<n>It effectively encodes the microenvironmental influences in cell embeddings, enabling the discovery of spatially-informed subpopulations.
arXiv Detail & Related papers (2025-06-11T12:29:01Z)
Intermediate State Formation of Topologically Associated Chromatin Domains using Quantum Annealing [0.0]
Topologically Associating Chromatic Domains are spatially distinct regions that regulate transcription by segregating genomic elements.<n>Recent models represent a spin system, where nucleosomes are treated as discrete-state variables.<n>We present a quantum annealing (QA) approach to efficiently sample states, embedding an epigenetic Ising model into the topology of D-Wave quantum processors.
arXiv Detail & Related papers (2025-05-29T09:40:39Z)
GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity [3.972930262155919]
We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats.
arXiv Detail & Related papers (2024-05-09T09:34:51Z)
Machine and deep learning methods for predicting 3D genome organization [0.0]
Three-Dimensional (3D) enhancer interactions play critical roles in a wide range of cellular processes by regulating gene expression. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, interactions, TAD boundaries) and analyze their pros and cons.
arXiv Detail & Related papers (2024-03-04T19:04:41Z)
A metric embedding kernel for live cell microscopy signaling patterns [0.1547863211792184]
We present a metric kernel function for patterns of cell signaling dynamics captured in 5-D live cell microscopy movies. The approach uses Kolmogorov complexity theory to compute a metric distance and movies to measure the meaningful information. Results are presented quantifying the impact of ERK and AKT signaling between different oncogenic mutations.
arXiv Detail & Related papers (2024-01-04T19:25:00Z)
Single-Cell Deep Clustering Method Assisted by Exogenous Gene Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells. During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation. This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z)
Analyzing scRNA-seq data by CCP-assisted UMAP and t-SNE [0.0]
Correlated clustering and projection (CCP) was introduced as an effective method for preprocessing scRNA-seq data. CCP is a data-domain approach that does not require matrix diagonalization. By using eight publicly available datasets, we have found that CCP significantly improves UMAP and t-SNE visualization.
arXiv Detail & Related papers (2023-06-23T19:15:43Z)
Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z)
Granger causal inference on DAGs identifies genomic loci regulating transcription [77.58911272503771]
GrID-Net is a framework based on graph neural networks with lagged message passing for Granger causal inference on DAG-structured systems. Our application is the analysis of single-cell multimodal data to identify genomic loci that mediate the regulation of specific genes.
arXiv Detail & Related papers (2022-10-18T21:15:10Z)
Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces. We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting. This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z)
Epigenomic language models powered by Cerebras [0.0]
Epigenomic BERT (or EBERT) learns representations based on both DNA sequence and paired epigenetic state inputs. We show EBERT's transfer learning potential by demonstrating strong performance on a cell type-specific transcription factor binding prediction task. Our fine-tuned model exceeds state of the art performance on 4 of 13 evaluation datasets from ENCODE-DREAM benchmarks and earns an overall rank of 3rd on the challenge leaderboard.
arXiv Detail & Related papers (2021-12-14T17:23:42Z)
Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.