Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of
Pre-trained Models' Transferability
- URL: http://arxiv.org/abs/2103.07162v1
- Date: Fri, 12 Mar 2021 09:19:14 GMT
- Title: Is BERT a Cross-Disciplinary Knowledge Learner? A Surprising Finding of
Pre-trained Models' Transferability
- Authors: Wei-Tsung Kao, Hung-Yi Lee
- Abstract summary: We investigate whether the power of the models pre-trained on text data, such as BERT, can be transferred to general token sequence classification applications.
We find that even on non-text data, the models pre-trained on text converge faster than the randomly models.
- Score: 74.11825654535895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate whether the power of the models pre-trained on
text data, such as BERT, can be transferred to general token sequence
classification applications. To verify pre-trained models' transferability, we
test the pre-trained models on (1) text classification tasks with meanings of
tokens mismatches, and (2) real-world non-text token sequence classification
data, including amino acid sequence, DNA sequence, and music. We find that even
on non-text data, the models pre-trained on text converge faster than the
randomly initialized models, and the testing performance of the pre-trained
models is merely slightly worse than the models designed for the specific
tasks.
Related papers
- Self-Supervised Representation Learning for Online Handwriting Text
Classification [0.8594140167290099]
We propose the novel Part of Stroke Masking (POSM) as a pretext task for pretraining models to extract informative representations from the online handwriting of individuals in English and Chinese languages.
To evaluate the quality of the extracted representations, we use both intrinsic and extrinsic evaluation methods.
The pretrained models are fine-tuned to achieve state-of-the-art results in tasks such as writer identification, gender classification, and handedness classification.
arXiv Detail & Related papers (2023-10-10T14:07:49Z) - TWINS: A Fine-Tuning Framework for Improved Transferability of
Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks.
We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework.
TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z) - Momentum Calibration for Text Generation [86.58432361938806]
We propose MoCa (bf Momentum bf Calibration) for text generation.
MoCa is an online method that dynamically generates slowly evolving (but consistent) samples using a momentum moving average generator with beam search.
arXiv Detail & Related papers (2022-12-08T13:12:10Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models.
We show that the nature of pre-training itself is a performant source of diversity.
We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z) - Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words"
Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z) - Data Augmentation using Pre-trained Transformer Models [2.105564340986074]
We study different types of transformer based pre-trained models such as auto-regressive models (GPT-2), auto-encoder models (BERT), and seq2seq models (BART) for conditional data augmentation.
We show that prepending the class labels to text sequences provides a simple yet effective way to condition the pre-trained models for data augmentation.
arXiv Detail & Related papers (2020-03-04T18:35:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.