Pretrained Language Model Embryology: The Birth of ALBERT
- URL: http://arxiv.org/abs/2010.02480v2
- Date: Thu, 29 Oct 2020 00:07:43 GMT
- Title: Pretrained Language Model Embryology: The Birth of ALBERT
- Authors: Cheng-Han Chiang, Sung-Feng Huang and Hung-yi Lee
- Abstract summary: We investigate the developmental process from a set of randomly parameters to a totipotent language model.
Our results show that ALBERT learns to reconstruct and predict tokens of different parts of speech (POS) in different learning speeds during pretraining.
These findings suggest that knowledge of a pretrained model varies during pretraining, and having more pretrain steps does not necessarily provide a model with more comprehensive knowledge.
- Score: 68.5801642674541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While behaviors of pretrained language models (LMs) have been thoroughly
examined, what happened during pretraining is rarely studied. We thus
investigate the developmental process from a set of randomly initialized
parameters to a totipotent language model, which we refer to as the embryology
of a pretrained language model. Our results show that ALBERT learns to
reconstruct and predict tokens of different parts of speech (POS) in different
learning speeds during pretraining. We also find that linguistic knowledge and
world knowledge do not generally improve as pretraining proceeds, nor do
downstream tasks' performance. These findings suggest that knowledge of a
pretrained model varies during pretraining, and having more pretrain steps does
not necessarily provide a model with more comprehensive knowledge. We will
provide source codes and pretrained models to reproduce our results at
https://github.com/d223302/albert-embryology.
Related papers
- Can training neural language models on a curriculum with developmentally
plausible data improve alignment with human reading behavior? [0.2745342790938508]
This paper explores the extent to which the misalignment between empirical and model-predicted behavior can be minimized by training models on more developmentally plausible data.
We trained teacher language models on the BabyLM "strict-small" dataset and used sentence level surprisal estimates from these teacher models to create a curriculum.
We found tentative evidence that our curriculum made it easier for models to acquire linguistic knowledge from the training data.
arXiv Detail & Related papers (2023-11-30T18:03:58Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - What do Large Language Models Learn beyond Language? [10.9650651784511]
We find that pretrained models significantly outperform comparable non-pretrained neural models.
Experiments surprisingly reveal that the positive effects of pre-training persist even when pretraining on multi-lingual text or computer code.
Our findings suggest a hitherto unexplored deep connection between pre-training and inductive learning abilities of language models.
arXiv Detail & Related papers (2022-10-21T23:43:13Z) - Continual Pre-Training Mitigates Forgetting in Language and Vision [43.80547864450793]
We show that continually pre-trained models are robust against catastrophic forgetting.
We provide empirical evidence supporting the fact that self-supervised pre-training is more effective in retaining previous knowledge than supervised protocols.
arXiv Detail & Related papers (2022-05-19T07:27:12Z) - Does Pre-training Induce Systematic Inference? How Masked Language
Models Acquire Commonsense Knowledge [91.15301779076187]
We introduce verbalized knowledge into the minibatches of a BERT model during pre-training and evaluate how well the model generalizes to supported inferences.
We find generalization does not improve over the course of pre-training, suggesting that commonsense knowledge is acquired from surface-level, co-occurrence patterns rather than induced, systematic reasoning.
arXiv Detail & Related papers (2021-12-16T03:13:04Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - A Survey of Knowledge Enhanced Pre-trained Models [28.160826399552462]
We refer to pre-trained language models with knowledge injection as knowledge-enhanced pre-trained language models (KEPLMs)
These models demonstrate deep understanding and logical reasoning and introduce interpretability.
arXiv Detail & Related papers (2021-10-01T08:51:58Z) - HerBERT: Efficiently Pretrained Transformer-based Language Model for
Polish [4.473327661758546]
This paper presents the first ablation study focused on Polish, which, unlike the isolating English language, is a fusional language.
We design and thoroughly evaluate a pretraining procedure of transferring knowledge from multilingual to monolingual BERT-based models.
Based on the proposed procedure, a Polish BERT-based language model -- HerBERT -- is trained.
arXiv Detail & Related papers (2021-05-04T20:16:17Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z) - Exploring Fine-tuning Techniques for Pre-trained Cross-lingual Models
via Continual Learning [74.25168207651376]
Fine-tuning pre-trained language models to downstream cross-lingual tasks has shown promising results.
We leverage continual learning to preserve the cross-lingual ability of the pre-trained model when we fine-tune it to downstream tasks.
Our methods achieve better performance than other fine-tuning baselines on the zero-shot cross-lingual part-of-speech tagging and named entity recognition tasks.
arXiv Detail & Related papers (2020-04-29T14:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.