Improving Length-Generalization in Transformers via Task Hinting
- URL: http://arxiv.org/abs/2310.00726v1
- Date: Sun, 1 Oct 2023 16:57:40 GMT
- Title: Improving Length-Generalization in Transformers via Task Hinting
- Authors: Pranjal Awasthi and Anupam Gupta
- Abstract summary: In particular, the performance of a transformer model trained on tasks up to a certain length drops sharply when applied to longer instances of the same problem.
This work proposes an approach based on task hinting towards addressing length generalization.
- Score: 42.95479331339189
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It has been observed in recent years that transformers have problems with
length generalization for certain types of reasoning and arithmetic tasks. In
particular, the performance of a transformer model trained on tasks (say
addition) up to a certain length (e.g., 5 digit numbers) drops sharply when
applied to longer instances of the same problem. This work proposes an approach
based on task hinting towards addressing length generalization. Our key idea is
that while training the model on task-specific data, it is helpful to
simultaneously train the model to solve a simpler but related auxiliary task as
well.
We study the classical sorting problem as a canonical example to evaluate our
approach. We design a multitask training framework and show that task hinting
significantly improve length generalization. For sorting we show that it is
possible to train models on data consisting of sequences having length at most
$20$, and improve the test accuracy on sequences of length $100$ from less than
1% (for standard training) to more than 92% (via task hinting).
Our study uncovers several interesting aspects of length generalization. We
observe that while several auxiliary tasks may seem natural a priori, their
effectiveness in improving length generalization differs dramatically. We
further use probing and visualization-based techniques to understand the
internal mechanisms via which the model performs the task, and propose a
theoretical construction consistent with the observed learning behaviors of the
model. Based on our construction, we show that introducing a small number of
length dependent parameters into the training procedure can further boost the
performance on unseen lengths. Finally, we also show the efficacy of our task
hinting based approach beyond sorting, giving hope that these techniques will
be applicable in broader contexts.
Related papers
- Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count [19.148785141454642]
Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training.
In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers.
arXiv Detail & Related papers (2024-10-21T08:49:51Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
Large language models (LLMs) have sparked debate over whether they genuinely generalize to unseen tasks or rely on memorizing vast amounts of pretraining data.
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks.
Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer.
Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z) - Teaching Arithmetic to Small Transformers [39.72665384986095]
This study investigates how small transformers can efficiently learn arithmetic operations.
We first demonstrate that conventional training data is not the most effective for arithmetic learning.
We then train on chain-of-thought style data that includes intermediate step results.
arXiv Detail & Related papers (2023-07-07T04:33:31Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - Explaining the Effectiveness of Multi-Task Learning for Efficient
Knowledge Extraction from Spine MRI Reports [2.5953185061765884]
We show that a single multi-tasking model can match the performance of task specific models.
We validate our observations on our internal radiologist-annotated datasets on the cervical and lumbar spine.
arXiv Detail & Related papers (2022-05-06T01:51:19Z) - Sequence Length is a Domain: Length-based Overfitting in Transformer
Models [0.0]
In machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches.
We show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.
arXiv Detail & Related papers (2021-09-15T13:25:19Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - Temporally Correlated Task Scheduling for Sequence Learning [143.70523777803723]
In many applications, a sequence learning task is usually associated with multiple temporally correlated auxiliary tasks.
We introduce a learnable scheduler to sequence learning, which can adaptively select auxiliary tasks for training.
Our method significantly improves the performance of simultaneous machine translation and stock trend forecasting.
arXiv Detail & Related papers (2020-07-10T10:28:54Z) - Generalized Hindsight for Reinforcement Learning [154.0545226284078]
We argue that low-reward data collected while trying to solve one task provides little to no signal for solving that particular task.
We present Generalized Hindsight: an approximate inverse reinforcement learning technique for relabeling behaviors with the right tasks.
arXiv Detail & Related papers (2020-02-26T18:57:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.