Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in
Transformer Models
- URL: http://arxiv.org/abs/2311.00871v1
- Date: Wed, 1 Nov 2023 21:41:08 GMT
- Title: Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in
Transformer Models
- Authors: Steve Yadlowsky, Lyric Doshi, Nilesh Tripuraneni
- Abstract summary: Transformer models have the remarkable ability to perform in-context learning (ICL)
We study how effectively transformers can bridge between their pretraining data mixture.
Our results highlight that the impressive ICL abilities of high-capacity sequence models may be more closely tied to the coverage of their pretraining data mixtures than inductive biases.
- Score: 9.340409961107955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models, notably large language models (LLMs), have the remarkable
ability to perform in-context learning (ICL) -- to perform new tasks when
prompted with unseen input-output examples without any explicit model training.
In this work, we study how effectively transformers can bridge between their
pretraining data mixture, comprised of multiple distinct task families, to
identify and learn new tasks in-context which are both inside and outside the
pretraining distribution. Building on previous work, we investigate this
question in a controlled setting, where we study transformer models trained on
sequences of $(x, f(x))$ pairs rather than natural language. Our empirical
results show transformers demonstrate near-optimal unsupervised model selection
capabilities, in their ability to first in-context identify different task
families and in-context learn within them when the task families are
well-represented in their pretraining data. However when presented with tasks
or functions which are out-of-domain of their pretraining data, we demonstrate
various failure modes of transformers and degradation of their generalization
for even simple extrapolation tasks. Together our results highlight that the
impressive ICL abilities of high-capacity sequence models may be more closely
tied to the coverage of their pretraining data mixtures than inductive biases
that create fundamental generalization capabilities.
Related papers
- In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference.
This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z) - Pre-trained Language Models Improve the Few-shot Prompt Ability of Decision Transformer [10.338170161831496]
Decision Transformer (DT) has emerged as a promising class of algorithms in offline reinforcement learning (RL) tasks.
We introduce the Language model-d Prompt Transformer (LPDT), which leverages pre-trained language models for meta-RL tasks and fine-tunes the model using Low-rank Adaptation (LoRA)
Our approach integrates pre-trained language model and RL tasks seamlessly.
arXiv Detail & Related papers (2024-08-02T17:25:34Z) - In-Context Learning for MIMO Equalization Using Transformer-Based
Sequence Models [44.161789477821536]
Large pre-trained sequence models have the capacity to carry out in-context learning (ICL)
In ICL, a decision on a new input is made via a direct mapping of the input and of a few examples from the given task.
We demonstrate via numerical results that transformer-based ICL has a threshold behavior.
arXiv Detail & Related papers (2023-11-10T15:09:04Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Concept-aware Training Improves In-context Learning Ability of Language
Models [0.0]
Many recent language models (LMs) of Transformers family exhibit so-called in-context learning (ICL) ability.
We propose a method to create LMs able to better utilize the in-context information.
We measure that data sampling of Concept-aware Training consistently improves models' reasoning ability.
arXiv Detail & Related papers (2023-05-23T07:44:52Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Pretrained Transformers as Universal Computation Engines [105.00539596788127]
We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning.
We study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction.
We find that such pretraining enables FPT to generalize in zero-shot to these modalities, matching the performance of a transformer fully trained on these tasks.
arXiv Detail & Related papers (2021-03-09T06:39:56Z) - End-to-end spoken language understanding using transformer networks and
self-supervised pre-trained features [17.407912171579852]
Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP)
We introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features.
arXiv Detail & Related papers (2020-11-16T19:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.