Fugu-MT 論文翻訳(概要): DNACHUNKER: Learnable Tokenization for DNA Language Models

論文の概要: DNACHUNKER: Learnable Tokenization for DNA Language Models

arxiv url: http://arxiv.org/abs/2601.03019v1
Date: Tue, 06 Jan 2026 13:46:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-07 17:02:12.958122
Title: DNACHUNKER: Learnable Tokenization for DNA Language Models
Title（参考訳）: DNACHUNKER:DNA言語モデルのための学習可能なトークン化
Authors: Taewon Kim, Jihwan Shin, Hyomin Kim, Youngmok Jung, Jonhoon Lee, Won-Chul Lee, Insu Han, Sungsoo Ahn,
Abstract要約: 本研究では,学習可能な動的DNAトークン化機構を統合したDNACHUNKERを提案する。我々はヒト基準ゲノム(HG38)を用いてDNACHUNKERの性能を訓練し、ヌクレオチドトランスフォーマーおよびゲノムベンチマークで試験することで、DNACHUNKERの性能を実証する。
参考スコア（独自算出の注目度）: 27.919576076056146
License: http://creativecommons.org/licenses/by/4.0/
Abstract: DNA language models have emerged as powerful tools for decoding the complex language of DNA sequences. However, the performance of these models is heavily affected by their tokenization strategy, i.e., a method used to parse DNA sequences into a shorter sequence of chunks. In this work, we propose DNACHUNKER, which integrates a learnable dynamic DNA tokenization mechanism and is trained as a masked language model. Adopting the dynamic chunking procedure proposed by H-Net, our model learns to segment sequences into variable-length chunks. This dynamic chunking offers two key advantages: it's resilient to shifts and mutations in the DNA, and it allocates more detail to important functional areas. We demonstrate the performance of DNACHUNKER by training it on the human reference genome (HG38) and testing it on the Nucleotide Transformer and Genomic benchmarks. Further ablative experiments reveal that DNACHUNKER learns tokenization that grasps biological grammar and uses smaller chunks to preserve detail in important functional elements such as promoters and exons, while using larger chunks for repetitive, redundant regions.
Abstract（参考訳）: DNA言語モデルは、DNA配列の複雑な言語をデコードするための強力なツールとして登場した。しかし、これらのモデルの性能は、そのトークン化戦略、すなわち、DNA配列を短いチャンク配列に解析する手法に大きく影響されている。本研究では,学習可能な動的DNAトークン化機構を統合し,マスキング言語モデルとして訓練されたDNACHUNKERを提案する。 H-Netが提案する動的チャンキング手法を応用して,本モデルでは,配列を可変長チャンクに分割する方法を学習する。この動的チャンキングには2つの大きな利点がある:DNAのシフトや突然変異に耐性があり、重要な機能領域にさらに詳細を割り当てる。我々はヒト基準ゲノム(HG38)を用いてDNACHUNKERの性能を訓練し、ヌクレオチドトランスフォーマーおよびゲノムベンチマークで試験することで、DNACHUNKERの性能を実証する。さらに、DNACHUNKERは、生物学的文法を把握し、より小さなチャンクを使用してプロモーターやエキソンなどの重要な機能要素の細部を保存し、反復的かつ冗長な領域により大きなチャンクを使用する。

論文の概要: DNACHUNKER: Learnable Tokenization for DNA Language Models

関連論文リスト