Toward Understanding BERT-Like Pre-Training for DNA Foundation Models
- URL: http://arxiv.org/abs/2310.07644v3
- Date: Sun, 8 Sep 2024 09:50:13 GMT
- Title: Toward Understanding BERT-Like Pre-Training for DNA Foundation Models
- Authors: Chaoqi Liang, Lifeng Qiao, Peng Ye, Nanqing Dong, Jianle Sun, Weiqiang Bai, Yuchen Ren, Xinzhu Ma, Hongliang Yan, Chunfeng Song, Wanli Ouyang, Wangmeng Zuo,
- Abstract summary: Existing pre-training methods for DNA sequences rely on direct adoptions of BERT pre-training from NLP.
We introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pre-training by continuously expanding its mask boundary.
RandomMask achieves a staggering 68.16% in Matthew's correlation coefficient for Epigenetic Mark Prediction, a groundbreaking increase of 19.85% over the baseline.
- Score: 78.48760388079523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the success of large-scale pre-training in language tasks, there is an increasing trend of applying it to the domain of life sciences. In particular, pre-training methods based on DNA sequences have received increasing attention because of their potential to capture general information about genes. However, existing pre-training methods for DNA sequences largely rely on direct adoptions of BERT pre-training from NLP, lacking a comprehensive understanding and a specifically tailored approach. To address this research gap, we provide the first empirical study with three insightful observations. Based on the empirical study, we notice that overlapping tokenizer can benefit the fine-tuning of downstream tasks but leads to inadequate pre-training with fast convergence. To unleash the pre-training potential, we introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pre-training by continuously expanding its mask boundary, forcing the model to learn more knowledge. RandomMask is simple but effective, achieving state-of-the-art performance across 6 downstream tasks. RandomMask achieves a staggering 68.16\% in Matthew's correlation coefficient for Epigenetic Mark Prediction, a groundbreaking increase of 19.85\% over the baseline and a remarkable 3.69\% improvement over the previous state-of-the-art result.
Related papers
- A Novel Hybrid Parameter-Efficient Fine-Tuning Approach for Hippocampus Segmentation and Alzheimer's Disease Diagnosis [12.775565417928895]
We propose a novel parameter-efficient fine-tuning strategy, termed HyPS, which employs a hybrid parallel and serial architecture.
HyPS updates a minimal subset of model parameters, thereby retaining the pre-trained model's original knowledge tructure.
In distinguishing Alzheimer's disease from cognitively normal (CN) individuals, HyPS achieved classification accuracies of 83.78% and 64.29%, respectively.
arXiv Detail & Related papers (2024-09-02T00:52:00Z) - Self-Distillation Improves DNA Sequence Inference [15.497250990633047]
Self-supervised pretraining (SSP) has been recognized as a method to enhance prediction accuracy in various downstream tasks.
This limitation stems primarily from the fact that most existing SSP approaches in genomics focus on masked language modeling of individual sequences.
We introduce an innovative deep neural network model, which incorporates collaborative learning between a student' and a teacher' subnetwork.
arXiv Detail & Related papers (2024-05-14T12:24:52Z) - Dissecting Deep RL with High Update Ratios: Combatting Value Divergence [21.282292112642747]
We show that deep reinforcement learning algorithms can retain their ability to learn without resetting network parameters.
We employ a simple unit-ball normalization that enables learning under large update ratios.
arXiv Detail & Related papers (2024-03-09T19:56:40Z) - Hierarchical Pretraining on Multimodal Electronic Health Records [53.63585531565068]
This paper introduces a novel, general, and unified pretraining framework called MEDHMP, specifically designed for hierarchically multimodal EHR data.
The effectiveness of the proposed MEDHMP is demonstrated through experimental results on eight downstream tasks spanning three levels.
arXiv Detail & Related papers (2023-10-11T20:23:33Z) - Multi-Level Contrastive Learning for Dense Prediction Task [59.591755258395594]
We present Multi-Level Contrastive Learning for Dense Prediction Task (MCL), an efficient self-supervised method for learning region-level feature representation for dense prediction tasks.
Our method is motivated by the three key factors in detection: localization, scale consistency and recognition.
Our method consistently outperforms the recent state-of-the-art methods on various datasets with significant margins.
arXiv Detail & Related papers (2023-04-04T17:59:04Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Does GNN Pretraining Help Molecular Representation? [5.5459878275267736]
Self-supervised graph pretraining does not have statistically significant advantages over non-pretraining methods in many settings.
Although improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits.
We hypothesize the complexity of pretraining on molecules is insufficient, leading to less transferable knowledge for downstream tasks.
arXiv Detail & Related papers (2022-07-13T07:34:16Z) - SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide
Association Study [48.75445626157713]
SNP2Vec is a scalable self-supervised pre-training approach for understanding SNP.
We apply SNP2Vec to perform long-sequence genomics modeling.
We evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort.
arXiv Detail & Related papers (2022-04-14T01:53:58Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.