Rethinking the BERT-like Pretraining for DNA Sequences
- URL: http://arxiv.org/abs/2310.07644v2
- Date: Thu, 12 Oct 2023 03:32:32 GMT
- Title: Rethinking the BERT-like Pretraining for DNA Sequences
- Authors: Chaoqi Liang, Weiqiang Bai, Lifeng Qiao, Yuchen Ren, Jianle Sun, Peng
Ye, Hongliang Yan, Xinzhu Ma, Wangmeng Zuo, and Wanli Ouyang
- Abstract summary: Existing pretraining methods for DNA sequences rely on direct adoptions of BERT pretraining from NLP.
We introduce a novel approach called RandomMask, which gradually increases the task difficulty of BERT-like pretraining by continuously expanding its mask boundary.
- Score: 72.85177907538872
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the success of large-scale pretraining in NLP, there is an increasing
trend of applying it to the domain of life sciences. In particular, pretraining
methods based on DNA sequences have garnered growing attention due to their
potential to capture generic information about genes. However, existing
pretraining methods for DNA sequences largely rely on direct adoptions of BERT
pretraining from NLP, lacking a comprehensive understanding and a specifically
tailored approach. To address this research gap, we first conducted a series of
exploratory experiments and gained several insightful observations: 1) In the
fine-tuning phase of downstream tasks, when using K-mer overlapping
tokenization instead of K-mer non-overlapping tokenization, both overlapping
and non-overlapping pretraining weights show consistent performance
improvement.2) During the pre-training process, using K-mer overlapping
tokenization quickly produces clear K-mer embeddings and reduces the loss to a
very low level, while using K-mer non-overlapping tokenization results in less
distinct embeddings and continuously decreases the loss. 3) Using overlapping
tokenization causes the self-attention in the intermediate layers of
pre-trained models to tend to overly focus on certain tokens, reflecting that
these layers are not adequately optimized. In summary, overlapping tokenization
can benefit the fine-tuning of downstream tasks but leads to inadequate
pretraining with fast convergence. To unleash the pretraining potential, we
introduce a novel approach called RandomMask, which gradually increases the
task difficulty of BERT-like pretraining by continuously expanding its mask
boundary, forcing the model to learn more knowledge. RandomMask is simple but
effective, achieving top-tier performance across 26 datasets of 28 datasets
spanning 7 downstream tasks.
Related papers
- Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization [13.475050661770796]
We develop a simple yet effective strategy to facilitate per-tensor activation quantization by preventing the generation of problematic tokens.
We tune the token cache to regularize the activations of subsequent tokens to be more quantization-friendly.
We thoroughly evaluate our method over a wide range of models and benchmarks and find that it significantly surpasses the established baseline of per-tensor W8A8 quantization.
arXiv Detail & Related papers (2024-06-17T18:33:44Z) - Incremental Self-training for Semi-supervised Learning [56.57057576885672]
IST is simple yet effective and fits existing self-training-based semi-supervised learning methods.
We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed.
arXiv Detail & Related papers (2024-04-14T05:02:00Z) - Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization [51.34904967046097]
Continual learning seeks to overcome the challenge of catastrophic forgetting, where a model forgets previously learnt information.
We introduce a novel prior-based method that better constrains parameter growth, reducing catastrophic forgetting.
Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments.
arXiv Detail & Related papers (2023-09-15T17:10:51Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Does GNN Pretraining Help Molecular Representation? [5.5459878275267736]
Self-supervised graph pretraining does not have statistically significant advantages over non-pretraining methods in many settings.
Although improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits.
We hypothesize the complexity of pretraining on molecules is insufficient, leading to less transferable knowledge for downstream tasks.
arXiv Detail & Related papers (2022-07-13T07:34:16Z) - Generalization, Mayhems and Limits in Recurrent Proximal Policy
Optimization [1.8570591025615453]
We highlight vital details that one must get right when adding recurrence to achieve a correct and efficient implementation.
We explore the limitations of recurrent PPO by the benchmarking contributed novel environments Mortar Mayhem and Searing Spotlights.
Remarkably, we can demonstrate a transition to strong generalization in Mortar Mayhem when scaling the number of training seeds.
arXiv Detail & Related papers (2022-05-23T07:54:15Z) - Revisiting Pretraining for Semi-Supervised Learning in the Low-Label
Regime [15.863530936691157]
Semi-supervised learning (SSL) addresses the lack of labeled data by exploiting large unlabeled data through pseudolabeling.
Recent studies combined finetuning (FT) from pretrained weights with SSL to mitigate the challenges and claimed superior results in the low-label regime.
arXiv Detail & Related papers (2022-05-06T03:53:25Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - Bi-Granularity Contrastive Learning for Post-Training in Few-Shot Scene [10.822477939237459]
We propose contrastive masked language modeling (CMLM) for post-training to integrate both token-level and sequence-level contrastive learnings.
CMLM surpasses several recent post-training methods in few-shot settings without the need for data augmentation.
arXiv Detail & Related papers (2021-06-04T08:17:48Z) - Two-phase Pseudo Label Densification for Self-training based Domain
Adaptation [93.03265290594278]
We propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD.
In the first phase, we use sliding window voting to propagate the confident predictions, utilizing intrinsic spatial-correlations in the images.
In the second phase, we perform a confidence-based easy-hard classification.
To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss.
arXiv Detail & Related papers (2020-12-09T02:35:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.