Fugu-MT 論文翻訳(概要): EvoLen: Evolution-Guided Tokenization for DNA Language Model

論文の概要: EvoLen: Evolution-Guided Tokenization for DNA Language Model

arxiv url: http://arxiv.org/abs/2604.08698v1
Date: Thu, 09 Apr 2026 18:41:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.536847
Title: EvoLen: Evolution-Guided Tokenization for DNA Language Model
Title（参考訳）: EvoLen:DNA言語モデルのための進化誘導型トークン化
Authors: Nan Huang, Xiaoxiao Zhou, Junxia Cui, Mario Tapia-Pacheco, Tiffany Amariuta, Yang Li, Jingbo Shang,
Abstract要約: EvoLenは、進化的成層と長さ認識デコードを組み合わせることで、モチーフスケールの関数配列単位をよりよく保存するトークンライザである。これらの結果は、トークン化が重要な帰納バイアスをもたらし、進化情報を組み込むことにより、より生物学的に意味があり、解釈可能なシーケンス表現が得られることを示している。
参考スコア（独自算出の注目度）: 37.47818233836275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and evolutionary constraint rather than linguistic convention. We argue that DNA tokenization should prioritize functional sequence patterns like regulatory motifs-short, recurring segments under evolutionary constraint and typically preserved across species. We incorporate evolutionary information directly into the tokenization process through EvoLen, a tokenizer that combines evolutionary stratification with length-aware decoding to better preserve motif-scale functional sequence units. EvoLen uses cross-species evolutionary signals to group DNA sequences, trains separate BPE tokenizers on each group, merges the resulting vocabularies via a rule prioritizing preserved patterns, and applies length-aware decoding with dynamic programming. Through controlled experiments, EvoLen improves the preservation of functional sequence patterns, differentiation across genomic contexts, and alignment with evolutionary constraint, while matching or outperforming standard BPE across diverse DNALM benchmarks. These results demonstrate that tokenization introduces a critical inductive bias and that incorporating evolutionary information yields more biologically meaningful and interpretable sequence representations.
Abstract（参考訳）: トークンはDNA言語モデル(DNALM)における表現の基本単位として機能するが、その設計は未解明のままである。自然言語とは異なり、DNAには固有のトークン境界や事前に定義された構成規則がなく、トークン化は自然に指定されたものではなく、基本的なモデリング決定である。バイトペアエンコーディング(BPE)のような既存のアプローチは、人間の生成する言語規則を反映したトークン構造を捉えるのに優れているが、DNAは言語慣習よりも生物学的機能と進化的制約によって組織されている。 DNAのトークン化は、進化的制約の下で繰り返し、典型的には種間で保存される、規制モチーフショートのような機能的配列パターンを優先すべきである、と我々は主張する。エボレン(EvoLen)は、進化的成層化と長さ認識デコードを組み合わせて、モチーフスケールの関数配列単位をよりよく保存するトークン化装置である。 EvoLenは、DNA配列をグループ化し、BPEトークンを個別に訓練し、保存されたパターンを優先順位付けする規則を通し、結果の語彙をマージし、動的プログラミングに長さ認識デコーディングを適用する。制御された実験を通じて、EvoLenは、機能配列パターンの保存、ゲノムコンテキスト間の分化、進化的制約との整合性の改善とともに、多様なDNALMベンチマークにおける標準BPEのマッチングや性能向上を実現している。これらの結果は、トークン化が重要な帰納バイアスをもたらし、進化情報を組み込むことにより、より生物学的に意味があり、解釈可能なシーケンス表現が得られることを示している。

論文の概要: EvoLen: Evolution-Guided Tokenization for DNA Language Model

関連論文リスト