Training Compute-Optimal Protein Language Models
- URL: http://arxiv.org/abs/2411.02142v1
- Date: Mon, 04 Nov 2024 14:58:37 GMT
- Title: Training Compute-Optimal Protein Language Models
- Authors: Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, Le Song,
- Abstract summary: Most protein language models are trained with extensive compute resources until performance gains plateau.
Our investigation is grounded in a massive dataset consisting of 939 million protein sequences.
We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens.
- Score: 48.79416103951816
- License:
- Abstract: We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model~(MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure- and function-related tasks, all within less or equivalent pre-training compute budgets.
Related papers
- Energy Efficient Protein Language Models: Leveraging Small Language Models with LoRA for Controllable Protein Generation [1.041213135652454]
We introduce two small protein language models, capable of both uncontrollable and controllable protein generation.
For the uncontrollable generation task, our best model achieves an average pLDDT score of 69.75, demonstrating robust performance in generating viable protein structures.
We also demonstrate the deployment of our models on the energy efficient ET-SoC-1 chip, significantly improving the TPS/W by a factor of 3.
arXiv Detail & Related papers (2024-11-08T20:52:06Z) - Design Proteins Using Large Language Models: Enhancements and Comparative Analyses [12.140433802768733]
We adopt a suite of pre-trained LLMs, including Mistral-7B1, Llama-2-7B2, Llama-3-8B3, and gemma-7B4, to produce valid protein sequences.
We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures.
Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models.
arXiv Detail & Related papers (2024-08-12T08:17:27Z) - Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training.
We propose three effective strategies to enhance LLM performance within a fixed compute budget.
Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z) - Are Protein Language Models Compute Optimal? [0.0]
We investigate the optimal ratio between model parameters and training tokens within a fixed compute budget.
Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases.
This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
arXiv Detail & Related papers (2024-06-11T13:32:11Z) - xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering
the Language of Protein [76.18058946124111]
We propose a unified protein language model, xTrimoPGLM, to address protein understanding and generation tasks simultaneously.
xTrimoPGLM significantly outperforms other advanced baselines in 18 protein understanding benchmarks across four categories.
It can also generate de novo protein sequences following the principles of natural ones, and can perform programmable generation after supervised fine-tuning.
arXiv Detail & Related papers (2024-01-11T15:03:17Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - Ankh: Optimized Protein Language Model Unlocks General-Purpose Modelling [0.0]
We present Ankh, the first general-purpose protein language model trained on Google's TPU-v4.
Ankh succeeds in learning protein evolutionary conservation-mutation trends and introducing functional diversity while retaining key structural-functional characteristics.
arXiv Detail & Related papers (2023-01-16T19:04:45Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z) - To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on
Resource Rich Tasks [25.05882459314221]
We show that as the number of training examples grow into the millions, the accuracy gap between finetuning BERT-based model and training vanilla LSTM from scratch narrows to within 1%.
Our findings indicate that pre-trained models might reach a diminishing return point as the supervised data size increases significantly.
arXiv Detail & Related papers (2020-06-15T18:18:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.