Related papers: Pretrained Transformers as Universal Computation Engines

Pretrained Transformers as Universal Computation Engines

URL: http://arxiv.org/abs/2103.05247v1
Date: Tue, 9 Mar 2021 06:39:56 GMT
Title: Pretrained Transformers as Universal Computation Engines
Authors: Kevin Lu, Aditya Grover, Pieter Abbeel, Igor Mordatch
Abstract summary: We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning. We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
Score: 105.00539596788127
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language improves performance and compute efficiency on non-language downstream tasks. In particular, we find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.

Related papers

Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized. We find that these random transformers can perform a wide range of meaningful algorithmic tasks. Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z)
Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models [9.340409961107955]
Transformer models have the remarkable ability to perform in-context learning (ICL) We study how effectively transformers can bridge between their pretraining data mixture. Our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases.
arXiv Detail & Related papers (2023-11-01T21:41:08Z)
Uncovering mesa-optimization algorithms in Transformers [61.06055590704677]
Some autoregressive models can learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. We show that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed. Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
arXiv Detail & Related papers (2023-09-11T22:42:50Z)
Don't Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers [1.9662978733004601]
Self-supervised pre-training of large-scale transformer models on text corpora followed by finetuning has achieved state-of-the-art on a number of natural language processing tasks. In our work, we find that this result is, in fact, an artifact of not tuning the learning rates.
arXiv Detail & Related papers (2021-07-26T20:20:48Z)
Consistent Accelerated Inference via Confident Adaptive Transformers [29.034390810078172]
We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers. We simultaneously increase computational efficiency, while guaranteeing a specifiable degree of consistency with the original model with high confidence. We demonstrate the effectiveness of this approach on four classification and regression tasks.
arXiv Detail & Related papers (2021-04-18T10:22:28Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
End-to-end spoken language understanding using transformer networks and self-supervised pre-trained features [17.407912171579852]
Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP) We introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features.
arXiv Detail & Related papers (2020-11-16T19:30:52Z)
Rethinking embedding coupling in pre-trained language models [46.11201932668366]
We re-evaluate the standard practice of sharing weights between input and output embeddings in pre-trained language models. We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation. We are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
arXiv Detail & Related papers (2020-10-24T07:43:00Z)
Multi-Stage Influence Function [97.19210942277354]
We develop a multi-stage influence function score to track predictions from a finetuned model all the way back to the pretraining data. We study two different scenarios with the pretrained embeddings fixed or updated in the finetuning tasks.
arXiv Detail & Related papers (2020-07-17T16:03:11Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.