Related papers: The Hydra Effect: Emergent Self-repair in Language Model Computations

The Hydra Effect: Emergent Self-repair in Language Model Computations

URL: http://arxiv.org/abs/2307.15771v1
Date: Fri, 28 Jul 2023 19:13:26 GMT
Title: The Hydra Effect: Emergent Self-repair in Language Model Computations
Authors: Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, Shane Legg
Abstract summary: We investigate the internal structure of language model computations using causal analysis. We show two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to cause another layer. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.
Score: 8.323441767835257
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We investigate the internal structure of language model computations using causal analysis and demonstrate two motifs: (1) a form of adaptive computation where ablations of one attention layer of a language model cause another layer to compensate (which we term the Hydra effect) and (2) a counterbalancing function of late MLP layers that act to downregulate the maximum-likelihood token. Our ablation studies demonstrate that language model layers are typically relatively loosely coupled (ablations to one layer only affect a small number of downstream layers). Surprisingly, these effects occur even in language models trained without any form of dropout. We analyse these effects in the context of factual recall and consider their implications for circuit-level attribution in language models.

Related papers

Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models [5.317065202153858]
We investigate how 25 models represent lexical identity and inflectional morphology across six typologically diverse languages.<n>We find that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers.<n>Remarkably, these encoding patterns emerge across all models we test, despite differences in architecture, size, and training regime.
arXiv Detail & Related papers (2025-06-02T18:01:56Z)
DBR: Divergence-Based Regularization for Debiasing Natural Language Understanding Models [50.54264918467997]
Pre-trained language models (PLMs) have achieved impressive results on various natural language processing tasks. Recent research has revealed that these models often rely on superficial features and shortcuts instead of developing a genuine understanding of language. We propose Divergence Based Regularization (DBR) to mitigate this shortcut learning behavior.
arXiv Detail & Related papers (2025-02-25T16:44:10Z)
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models [75.05436691700572]
We introduce ExpliCa, a new dataset for evaluating Large Language Models (LLMs) in explicit causal reasoning. We tested seven commercial and open-source LLMs on ExpliCa through prompting and perplexity-based metrics. Surprisingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events.
arXiv Detail & Related papers (2025-02-21T14:23:14Z)
Learning thresholds lead to stable language coexistence [0.0]
We introduce a language competition model that incorporates the effects of memory and learning in the language shift dynamics. On a coarse grained time scale, the effects of memory and learning can be expressed as thresholds on the speakers fractions of the competing languages.
arXiv Detail & Related papers (2024-06-14T14:24:02Z)
Talking Heads: Understanding Inter-layer Communication in Transformer Language Models [32.2976613483151]
We analyze a mechanism used in two LMs to selectively inhibit items in a context in one task. We find that models write into low-rank subspaces of the residual stream to represent features which are then read out by later layers.
arXiv Detail & Related papers (2024-06-13T18:12:01Z)
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models [59.627729608055006]
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence. We use computation-based probing to examine LMs' internal representations regarding distinct linguistic phenomena. As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities.
arXiv Detail & Related papers (2024-04-29T17:58:36Z)
Decoding Probing: Revealing Internal Linguistic Structures in Neural Language Models using Minimal Pairs [0.873811641236639]
We introduce a novel decoding probing' method to probe internal linguistic characteristics in neural language models layer by layer. By treating the language model as the brain' and its representations as neural activations', we decode grammaticality labels of minimal pairs from the intermediate layers' representations.
arXiv Detail & Related papers (2024-03-26T00:56:06Z)
CausalGym: Benchmarking causal interpretability methods on linguistic tasks [52.61917615039112]
We use CausalGym to benchmark the ability of interpretability methods to causally affect model behaviour. We study the pythia models (14M--6.9B) and assess the causal efficacy of a wide range of interpretability methods. We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena.
arXiv Detail & Related papers (2024-02-19T21:35:56Z)
Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains. How do language models of different sizes learn during pre-training? Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z)
Modeling Target-Side Morphology in Neural Machine Translation: A Comparison of Strategies [72.56158036639707]
Morphologically rich languages pose difficulties to machine translation. A large amount of differently inflected word surface forms entails a larger vocabulary. Some inflected forms of infrequent terms typically do not appear in the training corpus. Linguistic agreement requires the system to correctly match the grammatical categories between inflected word forms in the output sentence.
arXiv Detail & Related papers (2022-03-25T10:13:20Z)
Examining Scaling and Transfer of Language Model Architectures for Machine Translation [51.69212730675345]
Language models (LMs) process sequences in a single stack of layers, and encoder-decoder models (EncDec) utilize separate layer stacks for input and output processing. In machine translation, EncDec has long been the favoured approach, but with few studies investigating the performance of LMs.
arXiv Detail & Related papers (2022-02-01T16:20:15Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
Linguistically inspired morphological inflection with a sequence to sequence model [19.892441884896893]
Our research question is whether a neural network would be capable of learning inflectional morphemes for inflection production. We are using an inflectional corpus and a single layer seq2seq model to test this hypothesis. Our character-morpheme-based model creates inflection by predicting the stem character-to-character and the inflectional affixes as character blocks.
arXiv Detail & Related papers (2020-09-04T08:58:42Z)
CausaLM: Causal Model Explanation Through Counterfactual Language Models [33.29636213961804]
CausaLM is a framework for producing causal model explanations using counterfactual language representation models. We show that language representation models such as BERT can effectively learn a counterfactual representation for a given concept of interest. A byproduct of our method is a language representation model that is unaffected by the tested concept.
arXiv Detail & Related papers (2020-05-27T15:06:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.