What's in your Head? Emergent Behaviour in Multi-Task Transformer Models
- URL: http://arxiv.org/abs/2104.06129v1
- Date: Tue, 13 Apr 2021 12:04:30 GMT
- Title: What's in your Head? Emergent Behaviour in Multi-Task Transformer Models
- Authors: Mor Geva, Uri Katz, Aviv Ben-Arie, Jonathan Berant
- Abstract summary: We study the behaviour of non-target heads, that is, the output of heads when given input that belongs to a different task than the one they were trained for.
We find that non-target heads exhibit emergent behaviour, which may either explain the target task, or generalize beyond their original task.
- Score: 26.557793822750302
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The primary paradigm for multi-task training in natural language processing
is to represent the input with a shared pre-trained language model, and add a
small, thin network (head) per task. Given an input, a target head is the head
that is selected for outputting the final prediction. In this work, we examine
the behaviour of non-target heads, that is, the output of heads when given
input that belongs to a different task than the one they were trained for. We
find that non-target heads exhibit emergent behaviour, which may either explain
the target task, or generalize beyond their original task. For example, in a
numerical reasoning task, a span extraction head extracts from the input the
arguments to a computation that results in a number generated by a target
generative head. In addition, a summarization head that is trained with a
target question answering head, outputs query-based summaries when given a
question and a context from which the answer is to be extracted. This emergent
behaviour suggests that multi-task training leads to non-trivial extrapolation
of skills, which can be harnessed for interpretability and generalization.
Related papers
- When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers [64.1656365676171]
Task arithmetic refers to editing the pre-trained model by adding a weighted sum of task vectors.
This paper theoretically prove the effectiveness of task addition in simultaneously learning a set of irrelevant or irrelevant tasks.
We prove the proper selection for task arithmetic to achieve negation to out-of-domain tasks.
arXiv Detail & Related papers (2025-04-15T08:04:39Z) - Do Attention Heads Compete or Cooperate during Counting? [0.12116854758481393]
We present an in-depth mechanistic interpretability analysis of training small transformers on an elementary task, counting.
We ask whether the attention heads behave as a pseudo-ensemble, all solving the same subtask, or they perform different subtasks, meaning that they can only solve the original task in conjunction.
arXiv Detail & Related papers (2025-02-10T17:21:39Z) - Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.
We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.
We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Identifying Selections for Unsupervised Subtask Discovery [12.22188797558089]
We provide a theory to identify, and experiments to verify the existence of selection variables in data.
These selections serve as subgoals that indicate subtasks and guide policy.
In light of this idea, we develop a sequential non-negative matrix factorization (seq- NMF) method to learn these subgoals and extract meaningful behavior patterns as subtasks.
arXiv Detail & Related papers (2024-10-28T23:47:43Z) - Gradient-based inference of abstract task representations for generalization in neural networks [5.794537047184604]
We show that gradients backpropagated through a neural network to a task representation layer are an efficient way to infer current task demands.
We demonstrate that gradient-based inference provides higher learning efficiency and generalization to novel tasks and limits.
arXiv Detail & Related papers (2024-07-24T15:28:08Z) - Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.
This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z) - Picking the Underused Heads: A Network Pruning Perspective of Attention
Head Selection for Fusing Dialogue Coreference Information [50.41829484199252]
Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing.
We investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective.
arXiv Detail & Related papers (2023-12-15T05:27:24Z) - Multi-task Bias-Variance Trade-off Through Functional Constraints [102.64082402388192]
Multi-task learning aims to acquire a set of functions that perform well for diverse tasks.
In this paper we draw intuition from the two extreme learning scenarios -- a single function for all tasks, and a task-specific function that ignores the other tasks.
We introduce a constrained learning formulation that enforces domain specific solutions to a central function.
arXiv Detail & Related papers (2022-10-27T16:06:47Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - Coarse-to-Fine: Hierarchical Multi-task Learning for Natural Language
Understanding [51.31622274823167]
We propose a hierarchical framework with a coarse-to-fine paradigm, with the bottom level shared to all the tasks, the mid-level divided to different groups, and the top-level assigned to each of the tasks.
This allows our model to learn basic language properties from all tasks, boost performance on relevant tasks, and reduce the negative impact from irrelevant tasks.
arXiv Detail & Related papers (2022-08-19T02:46:20Z) - Pretext Tasks selection for multitask self-supervised speech
representation learning [23.39079406674442]
This paper introduces a method to select a group of pretext tasks among a set of candidates.
Experiments conducted on speaker recognition and automatic speech recognition validate our approach.
arXiv Detail & Related papers (2021-07-01T16:36:29Z) - Representation Learning Beyond Linear Prediction Functions [33.94130046391917]
We show that diversity can be achieved when source tasks and the target task use different prediction function spaces beyond linear functions.
For a general function class, we find that eluder dimension gives a lower bound on the number of tasks required for diversity.
arXiv Detail & Related papers (2021-05-31T14:21:52Z) - Distribution Matching for Heterogeneous Multi-Task Learning: a
Large-scale Face Study [75.42182503265056]
Multi-Task Learning has emerged as a methodology in which multiple tasks are jointly learned by a shared learning algorithm.
We deal with heterogeneous MTL, simultaneously addressing detection, classification & regression problems.
We build FaceBehaviorNet, the first framework for large-scale face analysis, by jointly learning all facial behavior tasks.
arXiv Detail & Related papers (2021-05-08T22:26:52Z) - Probing the Probing Paradigm: Does Probing Accuracy Entail Task
Relevance? [27.64235687067883]
We show that models can learn to encode linguistic properties even if they are not needed for the task on which the model was trained.
We demonstrate models can encode these properties considerably above chance-level even when distributed in the data as random noise.
arXiv Detail & Related papers (2020-05-02T06:19:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.