Related papers: Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling

Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling

URL: http://arxiv.org/abs/2507.15087v1
Date: Sun, 20 Jul 2025 19:02:07 GMT
Title: Evaluation of Coding Schemes for Transformer-based Gene Sequence Modeling
Authors: Chenlei Gong, Yuanhe Tian, Lei Mao, Yan Song,
Abstract summary: We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE.<n>BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens.<n>This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.
Score: 16.581099175248056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Currently, many studies view DNA sequences as a special type of language and utilize Transformers to model them. These studies use fixed-length k-mer segmentation and BPE subword tokenization but lack a systematic evaluation to determine which is superior. We compare k-mer segmentation with k=1,3,4,5,6, a 4,096-token BPE vocabulary, and three positional encoding methods-sinusoidal, AliBi, and RoPE. Each configuration is trained from scratch in 3, 6, 12, and 24-layer Transformer encoders and evaluated on GUE benchmark dataset. In general, BPE delivers higher and more stable performance across tasks by compressing frequent motifs into variable-length tokens, reducing sequence length, and improving model generalization. RoPE excels at capturing periodic motifs and extrapolating to long sequences, while AliBi also performs well on tasks driven by local dependencies. In terms of depth, we observe significant gains when increasing layers from 3 to 12, with only marginal improvements or slight overfitting at 24 layers. This study provides practical guidance for designing tokenization and positional encoding in DNA Transformer models.

Related papers

PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
Regulatory DNA sequence Design with Reinforcement Learning [56.20290878358356]
We propose a generative approach that leverages reinforcement learning to fine-tune a pre-trained autoregressive model.<n>We evaluate our method on promoter design tasks in two yeast media conditions and enhancer design tasks for three human cell types.
arXiv Detail & Related papers (2025-03-11T02:33:33Z)
Toward Relative Positional Encoding in Spiking Transformers [52.62008099390541]
Spiking neural networks (SNNs) are bio-inspired networks that mimic how neurons in the brain communicate through discrete spikes.<n>We introduce several strategies to approximate relative positional encoding (RPE) in spiking Transformers.
arXiv Detail & Related papers (2025-01-28T06:42:37Z)
Exploring the Role of Token in Transformer-based Time Series Forecasting [10.081240480138487]
Transformer-based methods are a mainstream approach for solving time series forecasting (TSF) Most focus on optimizing the model structure, with few studies paying attention to the role of tokens for predictions. We find that the gradients mainly depend on tokens that contribute to the predicted series, called positive tokens. To utilize T-PE and V-PE, we propose T2B-PE, a Transformer-based dual-branch framework.
arXiv Detail & Related papers (2024-04-16T07:21:39Z)
Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision [26.107996342704915]
This paper presents the Ensemble Nucleotide Byte-level-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks. In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
arXiv Detail & Related papers (2023-11-04T06:00:56Z)
SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT) This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method. SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z)
GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation [107.2752114891855]
Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation. We propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words.
arXiv Detail & Related papers (2022-07-29T04:10:36Z)
Pyramid-BERT: Reducing Complexity via Successive Core-set based Token Selection [23.39962989492527]
Transformer-based language models such as BERT have achieved the state-of-the-art on various NLP tasks, but are computationally prohibitive. We present Pyramid-BERT where we replace previously useds with a em core-set based token selection method justified by theoretical results. The core-set based token selection technique allows us to avoid expensive pre-training, gives a space-efficient fine tuning, and thus makes it suitable to handle longer sequence lengths.
arXiv Detail & Related papers (2022-03-27T19:52:01Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.