DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification
- URL: http://arxiv.org/abs/2509.25274v1
- Date: Sun, 28 Sep 2025 16:10:03 GMT
- Title: DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification
- Authors: Darren King, Yaser Atlasi, Gholamreza Rafiee,
- Abstract summary: DNABERT-2 is a transformer genomic language model that uses byte-pair encoding to learn variable-length tokens from DNA.<n>Gene enhancers control when and where genes switch on, yet their sequence diversity and tissue specificity make them hard to pinpoint in colorectal cancer.<n>This is the first study to apply a second-generation genomic language model with BPE tokenization to enhancer classification in colorectal cancer.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gene enhancers control when and where genes switch on, yet their sequence diversity and tissue specificity make them hard to pinpoint in colorectal cancer. We take a sequence-only route and fine-tune DNABERT-2, a transformer genomic language model that uses byte-pair encoding to learn variable-length tokens from DNA. Using assays curated via the Johnston Cancer Research Centre at Queen's University Belfast, we assembled a balanced corpus of 2.34 million 1 kb enhancer sequences, applied summit-centered extraction and rigorous de-duplication including reverse-complement collapse, and split the data stratified by class. With a 4096-term vocabulary and a 232-token context chosen empirically, the DNABERT-2-117M classifier was trained with Optuna-tuned hyperparameters and evaluated on 350742 held-out sequences. The model reached PR-AUC 0.759, ROC-AUC 0.743, and best F1 0.704 at an optimized threshold (0.359), with recall 0.835 and precision 0.609. Against a CNN-based EnhancerNet trained on the same data, DNABERT-2 delivered stronger threshold-independent ranking and higher recall, although point accuracy was lower. To our knowledge, this is the first study to apply a second-generation genomic language model with BPE tokenization to enhancer classification in colorectal cancer, demonstrating the feasibility of capturing tumor-associated regulatory signals directly from DNA sequence alone. Overall, our results show that transformer-based genomic models can move beyond motif-level encodings toward holistic classification of regulatory elements, offering a novel path for cancer genomics. Next steps will focus on improving precision, exploring hybrid CNN-transformer designs, and validating across independent datasets to strengthen real-world utility.
Related papers
- A Novel cVAE-Augmented Deep Learning Framework for Pan-Cancer RNA-Seq Classification [0.0]
We propose a novel deep learning framework that uses a class-conditional variational autoencoder (cVAE) to augment training data for pan-cancer gene expression classification.<n>We present detailed experimental results, including VAE training curves, performance metrics (ROC curves and confusion matrix), and architecture diagrams.
arXiv Detail & Related papers (2025-08-02T16:57:31Z) - Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods [0.0]
Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges.<n>We propose merging unique 6mer tokens with selected BPE tokens generated through 600 BPE cycles.<n>This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns.
arXiv Detail & Related papers (2025-07-24T16:45:23Z) - scMamba: A Pre-Trained Model for Single-Nucleus RNA Sequencing Analysis in Neurodegenerative Disorders [43.24785083027205]
scMamba is a pre-trained model designed to improve the quality and utility of snRNA-seq analysis.<n>Inspired by the recent Mamba model, scMamba introduces a novel architecture that incorporates a linear adapter layer, gene embeddings, and bidirectional Mamba blocks.<n>We demonstrate that scMamba outperforms benchmark methods in various downstream tasks, including cell type annotation, doublet detection, imputation, and the identification of differentially expressed genes.
arXiv Detail & Related papers (2025-02-12T11:48:22Z) - Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA [44.630039477717624]
MxDNA is a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.<n>We show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
arXiv Detail & Related papers (2024-12-18T10:55:43Z) - Hybrid deep learning-based strategy for the hepatocellular carcinoma cancer grade classification of H&E stained liver histopathology images [2.833640239679924]
Hepatocellular carcinoma (HCC) is a common type of liver cancer whose early-stage diagnosis is a common challenge.<n>We propose a hybrid deep learning-based architecture that uses transfer learning to extract the features from pre-trained convolutional neural network (CNN) models.<n>The proposed hybrid model showed improvement in accuracy of 2% and 4% over the pre-trained models in TCGA-LIHC and KMC databases.
arXiv Detail & Related papers (2024-12-04T07:26:36Z) - Brain Tumor Classification on MRI in Light of Molecular Markers [56.99710477905796]
Co-deletion of the 1p/19q gene is associated with clinical outcomes in low-grade gliomas.<n>This study aims to utilize a specially MRI-based convolutional neural network for brain cancer detection.
arXiv Detail & Related papers (2024-09-29T07:04:26Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - xTrimoGene: An Efficient and Scalable Representation Learner for
Single-Cell RNA-Seq Data [45.043516102428676]
We propose a novel asymmetric encoder-decoder transformer for scRNA-seq data, called xTrimoGene$alpha$ (or xTrimoGene for short)
xTrimoGene reduces FLOPs by one to two orders of magnitude compared to classical transformers while maintaining high accuracy.
Our experiments also show that the performance of xTrimoGene improves as we scale up the model sizes.
arXiv Detail & Related papers (2023-11-26T01:23:01Z) - Breast Ultrasound Tumor Classification Using a Hybrid Multitask
CNN-Transformer Network [63.845552349914186]
Capturing global contextual information plays a critical role in breast ultrasound (BUS) image classification.
Vision Transformers have an improved capability of capturing global contextual information but may distort the local image patterns due to the tokenization operations.
In this study, we proposed a hybrid multitask deep neural network called Hybrid-MT-ESTAN, designed to perform BUS tumor classification and segmentation.
arXiv Detail & Related papers (2023-08-04T01:19:32Z) - A Hybrid Machine Learning Model for Classifying Gene Mutations in Cancer using LSTM, BiLSTM, CNN, GRU, and GloVe [0.0]
We introduce a novel hybrid ensemble model that synergistically combines LSTM, BiLSTM, CNN, GRU, and GloVe embeddings for the classification of gene mutations in cancer.
Our approach achieved a training accuracy of 80.6%, precision of 81.6%, recall of 80.6%, and an F1 score of 83.1%, alongside a significantly reduced Mean Squared Error (MSE) of 2.596.
arXiv Detail & Related papers (2023-07-24T21:01:46Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome [10.051595222470304]
We argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models.
We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair$.
We introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints.
arXiv Detail & Related papers (2023-06-26T18:43:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.