Fugu-MT 論文翻訳(概要): DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification

論文の概要: DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification

arxiv url: http://arxiv.org/abs/2509.25274v1
Date: Sun, 28 Sep 2025 16:10:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.224562
Title: DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification
Title（参考訳）: DNABERT-2: 大腸癌遺伝子エンハンサー分類のためのゲノム言語モデル
Authors: Darren King, Yaser Atlasi, Gholamreza Rafiee,
Abstract要約: DNABERT-2は、DNAから可変長トークンを学習するためにバイトペアエンコーディングを使用するトランスフォーマーゲノム言語モデルである。遺伝子エンハンサーは、いつ、どこで遺伝子がスイッチされるかを制御するが、その配列の多様性と組織特異性は、大腸癌の特定を困難にしている。大腸癌におけるBPEトークン化を用いた第2世代のゲノム言語モデルを適用した最初の研究である。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Gene enhancers control when and where genes switch on, yet their sequence diversity and tissue specificity make them hard to pinpoint in colorectal cancer. We take a sequence-only route and fine-tune DNABERT-2, a transformer genomic language model that uses byte-pair encoding to learn variable-length tokens from DNA. Using assays curated via the Johnston Cancer Research Centre at Queen's University Belfast, we assembled a balanced corpus of 2.34 million 1 kb enhancer sequences, applied summit-centered extraction and rigorous de-duplication including reverse-complement collapse, and split the data stratified by class. With a 4096-term vocabulary and a 232-token context chosen empirically, the DNABERT-2-117M classifier was trained with Optuna-tuned hyperparameters and evaluated on 350742 held-out sequences. The model reached PR-AUC 0.759, ROC-AUC 0.743, and best F1 0.704 at an optimized threshold (0.359), with recall 0.835 and precision 0.609. Against a CNN-based EnhancerNet trained on the same data, DNABERT-2 delivered stronger threshold-independent ranking and higher recall, although point accuracy was lower. To our knowledge, this is the first study to apply a second-generation genomic language model with BPE tokenization to enhancer classification in colorectal cancer, demonstrating the feasibility of capturing tumor-associated regulatory signals directly from DNA sequence alone. Overall, our results show that transformer-based genomic models can move beyond motif-level encodings toward holistic classification of regulatory elements, offering a novel path for cancer genomics. Next steps will focus on improving precision, exploring hybrid CNN-transformer designs, and validating across independent datasets to strengthen real-world utility.
Abstract（参考訳）: 遺伝子エンハンサーは、いつ、どこで遺伝子がスイッチされるかを制御するが、その配列の多様性と組織特異性は、大腸癌の特定を困難にしている。バイトペアエンコーディングを用いてDNAから可変長トークンを学習するトランスフォーマーゲノム言語モデルであるDNABERT-2とシーケンスのみの経路を用いる。クイーンズ大学ベルファスト校のジョンストン癌研究センターで採取したアッセイを用いて、234万1kbのエンハンサー配列のバランスの取れたコーパスを組み立てた。 4096-term vocabularyと232-token contextを経験的に選択し、DNABERT-2-117M分類器をOptuna-tuned hyperparametersで訓練し、350742の保持配列で評価した。このモデルはPR-AUC 0.759、ROC-AUC 0.743、最適化しきい値(0.359)で最高のF1 0.704、リコール0.835、精度0.609に達した。同じデータでトレーニングされたCNNベースのEnhancerNetに対して、DNABERT-2はより強い閾値非依存のランクと高いリコールを提供したが、ポイント精度は低かった。本研究は,BPEトークン化を用いた第2世代ゲノミクス言語モデルを用いて大腸癌の分類を強化し,DNA配列から直接腫瘍関連制御シグナルを捕捉する可能性を示す最初の研究である。以上の結果から,トランスフォーマーに基づくゲノムモデルが,モチーフレベルのエンコーディングを超えて,規制要素の全体的分類へと移行し,がんゲノム学の新たな道筋を提供する可能性が示唆された。次のステップでは、精度の向上、ハイブリッドCNN変換器の設計の探索、現実のユーティリティを強化するための独立したデータセットの検証に注力する。

論文の概要: DNABERT-2: Fine-Tuning a Genomic Language Model for Colorectal Gene Enhancer Classification

関連論文リスト