On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets
- URL: http://arxiv.org/abs/2109.03537v1
- Date: Wed, 8 Sep 2021 10:39:57 GMT
- Title: On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets
- Authors: Cheng-Han Chiang and Hung-yi Lee
- Abstract summary: Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
- Score: 74.11825654535895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-training language models (LMs) on large-scale unlabeled text data makes
the model much easier to achieve exceptional downstream performance than their
counterparts directly trained on the downstream tasks. In this work, we study
what specific traits in the pre-training data, other than the semantics, make a
pre-trained LM superior to their counterparts trained from scratch on
downstream tasks. We propose to use artificially constructed datasets as the
pre-training data to exclude the effect of semantics, and further control what
characteristics the pre-training corpora have. By fine-tuning the pre-trained
models on GLUE benchmark, we can learn how beneficial it is to transfer the
knowledge from the model trained on the dataset possessing that specific trait.
We define and discuss three different characteristics in the artificial
dataset: 1) matching the token's uni-gram or bi-gram distribution between
pre-training and downstream fine-tuning, 2) the presence of the explicit
dependencies among the tokens in a sequence, 3) the length of the implicit
dependencies among the tokens in a sequence. Our experiments show that the
explicit dependencies in the sequences of the pre-training data are critical to
the downstream performance. Our results also reveal that models achieve better
downstream performance when pre-trained on a dataset with a longer range of
implicit dependencies. Based on our analysis, we find that models pre-trained
with artificial datasets are prone to learn spurious correlation in downstream
tasks. Our work reveals that even if the LMs are not pre-trained on natural
language, they still gain transferability on certain human language downstream
tasks once the LMs learn to model the token dependencies in the sequences. This
result helps us understand the exceptional transferability of pre-trained LMs.
Related papers
- Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens [31.568675300434816]
Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset.
During inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one.
This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time.
arXiv Detail & Related papers (2024-10-18T17:48:27Z) - A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - CTP: Towards Vision-Language Continual Pretraining via Compatible
Momentum Contrast and Topology Preservation [128.00940554196976]
Vision-Language Continual Pretraining (VLCP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets.
To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D.
The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data.
arXiv Detail & Related papers (2023-08-14T13:53:18Z) - Skill-it! A Data-Driven Skills Framework for Understanding and Training
Language Models [29.17711426767209]
We study how to best select data that leads to good downstream model performance across tasks.
We develop a new framework based on a simple hypothesis: just as humans acquire interdependent skills in a deliberate order, language models also follow a natural order when learning a set of skills from their training data.
arXiv Detail & Related papers (2023-07-26T18:01:49Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - How Well Self-Supervised Pre-Training Performs with Streaming Data? [73.5362286533602]
In real-world scenarios where data are collected in a streaming fashion, the joint training scheme is usually storage-heavy and time-consuming.
It is unclear how well sequential self-supervised pre-training performs with streaming data.
We find sequential self-supervised learning exhibits almost the same performance as the joint training when the distribution shifts within streaming data are mild.
arXiv Detail & Related papers (2021-04-25T06:56:48Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.