DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units
- URL: http://arxiv.org/abs/2505.02206v1
- Date: Sun, 04 May 2025 18:02:28 GMT
- Title: DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units
- Authors: Lei Mao, Yuanhe Tian, Yan Song,
- Abstract summary: Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles.<n>We propose DNAZEN, an enhanced genomic representation framework designed to learn from various granularities in gene sequences.<n>A Transformer-based G-gram encoder is also proposed and the matched G-grams are fed into it to compute their representations and integrated into the encoder for basic unit.
- Score: 18.113659670915474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles such as words and syntax. Recent studies utilize advanced neural networks, ranging from convolutional and recurrent models to Transformer-based models, to capture contextual information of gene sequence, with the primary goal of obtaining effective gene sequence representations and thus enhance the models' understanding of various running gene samples. However, these approaches often directly apply language modeling techniques to gene sequences and do not fully consider the intrinsic information organization in them, where they do not consider how units at different granularities contribute to representation. In this paper, we propose DNAZEN, an enhanced genomic representation framework designed to learn from various granularities in gene sequences, including small polymers and G-grams that are combinations of several contiguous polymers. Specifically, we extract the G-grams from large-scale genomic corpora through an unsupervised approach to construct the G-gram vocabulary, which is used to provide G-grams in the learning process of DNA sequences through dynamically matching from running gene samples. A Transformer-based G-gram encoder is also proposed and the matched G-grams are fed into it to compute their representations and integrated into the encoder for basic unit (E4BU), which is responsible for encoding small units and maintaining the learning and inference process. To further enhance the learning process, we propose whole G-gram masking to train DNAZEN, where the model largely favors the selection of each entire G-gram to mask rather than an ordinary masking mechanism performed on basic units. Experiments on benchmark datasets demonstrate the effectiveness of DNAZEN on various downstream tasks.
Related papers
- Masked Language Models are Good Heterogeneous Graph Generalizers [54.08788279971086]
Masked Language Modeling-based method called LLM4HG.<n>Uses metapath-based sequences instead of HG tokens to extract structural and semantic information.
arXiv Detail & Related papers (2025-06-06T15:21:24Z) - GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z) - A Novel Graph Transformer Framework for Gene Regulatory Network Inference [0.27624021966289597]
Inference of gene regulatory networks (GRNs) may not always reflect true biological interactions.<n>Most GRN inference methods face several challenges in the network reconstruction phase.<n>We employ autoencoder embeddings to capture gene expression patterns directly from raw data.<n>We embed the prior knowledge from GRN structures transforming them into a text-like representation.
arXiv Detail & Related papers (2025-04-23T06:24:26Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models [59.60208063956459]
Large Language Models (LLMs) require high quality instruction data for effective alignment.<n>We present Genetic-Instruct, a scalable algorithm for synthesizing large-scale, high quality coding instructions.
arXiv Detail & Related papers (2024-07-29T20:42:59Z) - Gene Regulatory Network Inference from Pre-trained Single-Cell Transcriptomics Transformer with Joint Graph Learning [10.44434676119443]
Inferring gene regulatory networks (GRNs) from single-cell RNA sequencing (scRNA-seq) data is a complex challenge.
In this study, we tackle this challenge by leveraging the single-cell BERT-based pre-trained transformer model (scBERT)
We introduce a novel joint graph learning approach that combines the rich contextual representations learned by single-cell language models with the structured knowledge encoded in GRNs.
arXiv Detail & Related papers (2024-07-25T16:42:08Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - SGC: A semi-supervised pipeline for gene clustering using self-training
approach in gene co-expression networks [3.8073142980733]
We propose a novel pipeline for gene clustering based on mathematics of spectral network theory.
SGC consists of multiple novel steps that enable the computation of highly enriched modules in an unsupervised manner.
We show that SGC results in higher enrichment in real data.
arXiv Detail & Related papers (2022-09-21T14:51:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.