Related papers: What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

URL: http://arxiv.org/abs/2404.01601v1
Date: Tue, 2 Apr 2024 02:45:12 GMT
Title: What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks
Authors: Xingwu Chen, Difan Zou,
Abstract summary: We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. We identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations.
Score: 15.874604623294427
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to perform memorization, reasoning, generalization, and contextual generalization. We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. Then, we show that exhibiting reasoning and generalization ability requires the transformer to have at least two attention layers, while context generalization ability may necessitate three attention layers. Additionally, we identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations and thus can be resolved by stacking multiple attention layers. This sheds light on studying more practical and complex tasks beyond our design. Numerical experiments corroborate our theoretical findings.

Related papers

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers [0.0]
I introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. I demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. I find that successful learning occurs only under the presence of an implicit curriculum.
arXiv Detail & Related papers (2024-11-18T23:12:13Z)
One-Layer Transformer Provably Learns One-Nearest Neighbor In Context [48.4979348643494]
We study the capability of one-layer transformers learning the one-nearest neighbor rule. A single softmax attention layer can successfully learn to behave like a one-nearest neighbor.
arXiv Detail & Related papers (2024-11-16T16:12:42Z)
In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z)
Exploring Concept Depth: How Large Language Models Acquire Knowledge at Different Layers? [57.04803703952721]
Large language models (LLMs) have shown remarkable performances across a wide range of tasks. However, the mechanisms by which these models encode tasks of varying complexities remain poorly understood. We introduce the idea of Concept Depth'' to suggest that more complex concepts are typically acquired in deeper layers.
arXiv Detail & Related papers (2024-04-10T14:56:40Z)
Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding [52.723297744257536]
Pre-trained language models (LMs) have shown effectiveness in scientific literature understanding tasks. We propose a multi-task contrastive learning framework, SciMult, to facilitate common knowledge sharing across different literature understanding tasks.
arXiv Detail & Related papers (2023-05-23T16:47:22Z)
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions. We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition. These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z)
Unveiling Transformers with LEGO: a synthetic reasoning task [23.535488809197787]
We study how the transformer architecture learns to follow a chain of reasoning. In some data regime the trained transformer finds "shortcut" solutions to follow the chain of reasoning. We find that one can prevent such shortcut with appropriate architecture modification or careful data preparation.
arXiv Detail & Related papers (2022-06-09T06:30:17Z)
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph. Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z)
Evolving Attention with Residual Convolutions [29.305149185821882]
We propose a novel mechanism based on evolving attention to improve the performance of transformers. The proposed attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks.
arXiv Detail & Related papers (2021-02-20T15:24:06Z)
Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections. Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.