The Hydra Effect: Emergent Self-repair in Language Model Computations
- URL: http://arxiv.org/abs/2307.15771v1
- Date: Fri, 28 Jul 2023 19:13:26 GMT
- Title: The Hydra Effect: Emergent Self-repair in Language Model Computations
- Authors: Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane
Legg
- Abstract summary: We investigate the internal structure of language model computations using causal analysis.
We show two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to cause another layer.
We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.
- Score: 8.323441767835257
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the internal structure of language model computations using
causal analysis and demonstrate two motifs: (1) a form of adaptive computation
where ablations of one attention layer of a language model cause another layer
to compensate (which we term the Hydra effect) and (2) a counterbalancing
function of late MLP layers that act to downregulate the maximum-likelihood
token. Our ablation studies demonstrate that language model layers are
typically relatively loosely coupled (ablations to one layer only affect a
small number of downstream layers). Surprisingly, these effects occur even in
language models trained without any form of dropout. We analyse these effects
in the context of factual recall and consider their implications for
circuit-level attribution in language models.
Related papers
- Learning thresholds lead to stable language coexistence [0.0]
We introduce a language competition model that incorporates the effects of memory and learning on the language shift dynamics.
On a coarse grained time scale, the effects of memory and learning can be expressed as thresholds on the speakers fractions.
arXiv Detail & Related papers (2024-06-14T14:24:02Z) - Talking Heads: Understanding Inter-layer Communication in Transformer Language Models [32.2976613483151]
We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task.
We find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers.
arXiv Detail & Related papers (2024-06-13T18:12:01Z) - Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence.
We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena.
As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z) - Decoding Probing: Revealing Internal Linguistic Structures in Neural Language Models using Minimal Pairs [0.873811641236639]
We introduce a novel decoding probing' method to probe internal linguistic characteristics in neural language models layer by layer.
By treating the language model as the brain' and its representations as neural activations', we decode grammaticality labels of minimal pairs from the intermediate layers' representations.
arXiv Detail & Related papers (2024-03-26T00:56:06Z) - CausalGym: Benchmarking causal interpretability methods on linguistic
tasks [52.61917615039112]
We use CausalGym to benchmark the ability of interpretability methods to causally affect model behaviour.
We study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods.
We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena.
arXiv Detail & Related papers (2024-02-19T21:35:56Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Modeling Target-Side Morphology in Neural Machine Translation: A
Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation.
A large amount of differently inflected word surface forms entails a larger vocabulary.
Some inflected forms of infrequent terms typically do not appear in the training corpus.
Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z) - Examining Scaling and Transfer of Language Model Architectures for
Machine Translation [51.69212730675345]
Language models (LMs) process sequences in a single stack of layers, and encoder-decoder models (EncDec) utilize separate layer stacks for input and output processing.
In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs.
arXiv Detail & Related papers (2022-02-01T16:20:15Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Linguistically inspired morphological inflection with a sequence to
sequence model [19.892441884896893]
Our research question is whether a neural network would be capable of learning inflectional morphemes for inflection production.
We are using an inflectional corpus and a single layer seq2seq model to test this hypothesis.
Our character-morpheme-based model creates inflection by predicting the stem character-to-character and the inflectional affixes as character blocks.
arXiv Detail & Related papers (2020-09-04T08:58:42Z) - CausaLM: Causal Model Explanation Through Counterfactual Language Models [33.29636213961804]
CausaLM is a framework for producing causal model explanations using counterfactual language representation models.
We show that language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest.
A byproduct of our method is a language representation model that is unaffected by the tested concept.
arXiv Detail & Related papers (2020-05-27T15:06:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.