Fugu-MT 論文翻訳(概要): Rethinking the BERT-like Pretraining for DNA Sequences

論文の概要: Rethinking the BERT-like Pretraining for DNA Sequences

arxiv url: http://arxiv.org/abs/2310.07644v1
Date: Wed, 11 Oct 2023 16:40:57 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-12 21:52:41.848372
Title: Rethinking the BERT-like Pretraining for DNA Sequences
Title（参考訳）: DNA配列に対するBERT-like Pretrainingの再検討
Authors: Chaoqi Liang, Weiqiang Bai, Lifeng Qiao, Yuchen Ren, Jianle Sun, Peng Ye, Hongliang Yan, Xinzhu Ma, Wangmeng Zuo, and Wanli Ouyang
Abstract要約: 既存のDNA配列の事前訓練方法は、NLPからのBERTの直接導入に依存している。マスク境界を連続的に拡張することにより,BERTのような事前訓練のタスク困難を徐々に増大させるRandomMaskという新しい手法を提案する。
参考スコア（独自算出の注目度）: 72.85177907538872
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the success of large-scale pretraining in NLP, there is an increasing trend of applying it to the domain of life sciences. In particular, pretraining methods based on DNA sequences have garnered growing attention due to their potential to capture generic information about genes. However, existing pretraining methods for DNA sequences largely rely on direct adoptions of BERT pretraining from NLP, lacking a comprehensive understanding and a specifically tailored approach. To address this research gap, we first conducted a series of exploratory experiments and gained several insightful observations: 1) In the fine-tuning phase of downstream tasks, when using K-mer overlapping tokenization instead of K-mer non-overlapping tokenization, both overlapping and non-overlapping pretraining weights show consistent performance improvement.2) During the pre-training process, using K-mer overlapping tokenization quickly produces clear K-mer embeddings and reduces the loss to a very low level, while using K-mer non-overlapping tokenization results in less distinct embeddings and continuously decreases the loss. 3) Using overlapping tokenization causes the self-attention in the intermediate layers of pre-trained models to tend to overly focus on certain tokens, reflecting that these layers are not adequately optimized. In summary, overlapping tokenization can benefit the fine-tuning of downstream tasks but leads to inadequate pretraining with fast convergence. To unleash the pretraining potential, we introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pretraining by continuously expanding its mask boundary, forcing the model to learn more knowledge. RandomMask is simple but effective, achieving top-tier performance across 26 datasets of 28 datasets spanning 7 downstream tasks.
Abstract（参考訳）: NLPにおける大規模プレトレーニングの成功により、生命科学分野に適用する傾向が高まっている。特に、DNA配列に基づく事前学習法は、遺伝子に関する一般的な情報を取得する可能性から注目されている。しかし、既存のDNA配列の事前訓練法は主にNLPからのBERTの直接導入に依存しており、包括的理解と特異的に調整されたアプローチが欠如している。この研究ギャップに対処するため、私たちはまず一連の探索実験を行い、いくつかの洞察に富んだ観察を行った。 1) In the fine-tuning phase of downstream tasks, when using K-mer overlapping tokenization instead of K-mer non-overlapping tokenization, both overlapping and non-overlapping pretraining weights show consistent performance improvement.2) During the pre-training process, using K-mer overlapping tokenization quickly produces clear K-mer embeddings and reduces the loss to a very low level, while using K-mer non-overlapping tokenization results in less distinct embeddings and continuously decreases the loss. 3) 重なり合うトークン化を用いることで,事前学習されたモデルの中間層における自己着脱は,これらの層が適切に最適化されていないことを反映して,特定のトークンに過度に注目する傾向がある。要約すると、重複するトークン化は下流タスクの微調整に役立つが、高速収束による不適切な事前トレーニングにつながる。これは、マスク境界を継続的に拡大し、モデルにより多くの知識を学ばせることによって、bertのような事前学習のタスクの難易度を徐々に高めるものである。 RandomMaskはシンプルだが効果的で、7つのダウンストリームタスクにまたがる28のデータセットからなる26のデータセットで最高のパフォーマンスを実現する。

関連論文リスト

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
本研究では,マルチヘッド型ソフトマックスアテンションモデルを用いて,線形データを用いたコンテキスト内学習を行う方法について検討する。この結果から,学習内容の学習能力は,そのアーキテクチャと基礎となるデータ分布の集約的効果として,訓練されたトランスフォーマーから出現することが明らかとなった。
論文参考訳（メタデータ） (2025-03-17T02:00:49Z)
Post-Hoc Uncertainty Quantification in Pre-Trained Neural Networks via Activation-Level Gaussian Processes [0.15705429611931052]
本稿では,ガウス過程活性化関数(GAPA)を導入し,ニューロンレベルの不確実性を捉える。我々のアプローチは、トレーニング済みニューラルネットワークの本来の平均予測を保ちながら、ポストホックな方法で動作します。
論文参考訳（メタデータ） (2025-02-28T11:29:06Z)
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA [44.630039477717624]
MxDNAは、モデルが段階的に有効なDNAトークン化戦略を自律的に学習する新しいフレームワークである。我々は、MxDNAが従来の方法とは異なるユニークなトークン化戦略を学習し、自己教師付き事前学習中にトークンレベルでゲノム機能をキャプチャすることを示す。
論文参考訳（メタデータ） (2024-12-18T10:55:43Z)
A Novel Hybrid Parameter-Efficient Fine-Tuning Approach for Hippocampus Segmentation and Alzheimer's Disease Diagnosis [12.775565417928895]
本稿では,ハイブリッド並列およびシリアルアーキテクチャを用いたHyPSと呼ばれる,パラメータ効率の高いファインチューニング手法を提案する。 HyPSはモデルパラメータの最小限のサブセットを更新し、事前訓練されたモデルの元の知識トラクチャを保持する。アルツハイマー病を認知正常(CN)個体と区別することで、HyPSはそれぞれ83.78%と64.29%の分類精度を達成した。
論文参考訳（メタデータ） (2024-09-02T00:52:00Z)
Self-Distillation Improves DNA Sequence Inference [15.497250990633047]
SSP(Self-supervised pretraining)は、様々な下流タスクにおける予測精度を高める方法として認識されている。この制限は主に、ゲノム学における既存のSSPアプローチが個々の配列のマスキング言語モデリングに焦点を当てているという事実に起因している。本稿では,学生と教師のサブネットワーク間の協調学習を取り入れた,革新的なディープニューラルネットワークモデルを提案する。
論文参考訳（メタデータ） (2024-05-14T12:24:52Z)
Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
ネットワークパラメータをリセットすることなく、深層強化学習アルゴリズムが学習能力を維持できることを示す。我々は,大規模な更新率での学習を可能にする,単純な単球正規化を採用している。
論文参考訳（メタデータ） (2024-03-09T19:56:40Z)
Hierarchical Pretraining on Multimodal Electronic Health Records [53.63585531565068]
本稿では,階層的マルチモーダルEHRデータに特化して設計されたMEDHMPという,新規で汎用的で統一的な事前学習フレームワークを紹介する。提案したMEDHMPの有効性は,3つのレベルにまたがる8つの下流タスクの実験結果を通じて実証された。
論文参考訳（メタデータ） (2023-10-11T20:23:33Z)
Multi-Level Contrastive Learning for Dense Prediction Task [59.591755258395594]
本稿では,高密度予測タスクのための領域レベルの特徴表現を効率よく学習するための,MCL(Multi-Level Contrastive Learning for Dense Prediction Task)を提案する。本手法は, 局所化, スケールの整合性, 認識の3つの要因に動機付けられている。提案手法は,様々なデータセットにおける最近の最先端の手法よりも有意なマージンを有する。
論文参考訳（メタデータ） (2023-04-04T17:59:04Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
本稿では,様々な分類タスクにおいて,逆向きに事前訓練されたモデルの微調整に焦点を当てる。本稿では,TWINS(Two-WIng NormliSation)ファインチューニングフレームワークを提案する。 TWINSは、一般化とロバスト性の両方の観点から、幅広い画像分類データセットに有効であることが示されている。
論文参考訳（メタデータ） (2023-03-20T14:12:55Z)
Does GNN Pretraining Help Molecular Representation? [5.5459878275267736]
自己教師付きグラフ事前学習は、多くの設定において非事前学習法に対して統計的に有意な優位性を持たない。追加の教師付き事前トレーニングでは改善が観察できるが、よりリッチな機能やバランスの取れたデータ分割によって改善は減少する可能性がある。我々は、分子の事前学習の複雑さが不十分であり、下流のタスクに対する伝達可能な知識が少なくなると仮定する。
論文参考訳（メタデータ） (2022-07-13T07:34:16Z)
SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study [48.75445626157713]
SNP2Vecは、SNPを理解するためのスケーラブルな自己教師付き事前学習手法である。本研究では,SNP2Vecを用いて時系列ゲノミクスモデリングを行う。中国コホートにおけるアルツハイマー病のリスク予測におけるアプローチの有効性について検討した。
論文参考訳（メタデータ） (2022-04-14T01:53:58Z)
Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model [93.9943278892735]
タンパク質配列表現学習の鍵となる問題は、配列中の残基間の共変量によって反映される共進化情報をキャプチャすることである。 Pairwise Masked Language Model (PMLM) と呼ばれる専用言語モデルによる事前学習により,この情報を直接キャプチャする新しい手法を提案する。提案手法は, 相互関係を効果的に把握し, ベースラインと比較して, 接触予測性能を最大9%向上できることを示す。
論文参考訳（メタデータ） (2021-10-29T04:01:32Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。