PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word
Tokenization on Downstream Applications
- URL: http://arxiv.org/abs/2310.17415v1
- Date: Thu, 26 Oct 2023 14:20:44 GMT
- Title: PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word
Tokenization on Downstream Applications
- Authors: Yang Tan, Mingchen Li, Pan Tan, Ziyi Zhou, Huiqun Yu, Guisheng Fan,
Liang Hong
- Abstract summary: PETA trained language models with 14 different vocabulary sizes under three tokenization methods.
It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities.
Experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance.
- Score: 9.782175445247127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large protein language models are adept at capturing the underlying
evolutionary information in primary structures, offering significant practical
value for protein engineering. Compared to natural language models, protein
amino acid sequences have a smaller data volume and a limited combinatorial
space. Choosing an appropriate vocabulary size to optimize the pre-trained
model is a pivotal issue. Moreover, despite the wealth of benchmarks and
studies in the natural language community, there remains a lack of a
comprehensive benchmark for systematically evaluating protein language model
quality. Given these challenges, PETA trained language models with 14 different
vocabulary sizes under three tokenization methods. It conducted thousands of
tests on 33 diverse downstream datasets to assess the models' transfer learning
capabilities, incorporating two classification heads and three random seeds to
mitigate potential biases. Extensive experiments indicate that vocabulary sizes
between 50 and 200 optimize the model, whereas sizes exceeding 800
detrimentally affect the model's representational performance. Our code, model
weights and datasets are available at
https://github.com/ginnm/ProteinPretraining.
Related papers
- Emergent Abilities in Reduced-Scale Generative Language Models [10.51168925267033]
Large language models can solve new tasks without task-specific fine-tuning.
This ability is considered an emergent ability and is primarily seen in large language models with billions of parameters.
This study investigates if such emergent properties are strictly tied to model size or can be demonstrated by smaller models trained on reduced-scale data.
arXiv Detail & Related papers (2024-04-02T18:00:28Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Reprogramming Pretrained Language Models for Protein Sequence
Representation Learning [68.75392232599654]
We propose Representation Learning via Dictionary Learning (R2DL), an end-to-end representation learning framework.
R2DL reprograms a pretrained English language model to learn the embeddings of protein sequences.
Our model can attain better accuracy and significantly improve the data efficiency by up to $105$ times over the baselines set by pretrained and standard supervised methods.
arXiv Detail & Related papers (2023-01-05T15:55:18Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Scaling Language Models: Methods, Analysis & Insights from Training
Gopher [83.98181046650664]
We present an analysis of Transformer-based language model performance across a wide range of model scales.
Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language.
We discuss the application of language models to AI safety and the mitigation of downstream harms.
arXiv Detail & Related papers (2021-12-08T19:41:47Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.