Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small
- URL: http://arxiv.org/abs/2211.00593v1
- Date: Tue, 1 Nov 2022 17:08:44 GMT
- Title: Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small
- Authors: Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris and
Jacob Steinhardt
- Abstract summary: We present an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI)
To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model.
- Score: 68.879023473838
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research in mechanistic interpretability seeks to explain behaviors of
machine learning models in terms of their internal components. However, most
previous work either focuses on simple behaviors in small models, or describes
complicated behaviors in larger models with broad strokes. In this work, we
bridge this gap by presenting an explanation for how GPT-2 small performs a
natural language task called indirect object identification (IOI). Our
explanation encompasses 26 attention heads grouped into 7 main classes, which
we discovered using a combination of interpretability approaches relying on
causal interventions. To our knowledge, this investigation is the largest
end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a
language model. We evaluate the reliability of our explanation using three
quantitative criteria--faithfulness, completeness and minimality. Though these
criteria support our explanation, they also point to remaining gaps in our
understanding. Our work provides evidence that a mechanistic understanding of
large ML models is feasible, opening opportunities to scale our understanding
to both larger models and more complex tasks.
Related papers
- Sufficient and Necessary Explanations (and What Lies in Between) [6.9035001722324685]
We study two precise notions of feature importance for general machine learning models: sufficiency and necessity.
We propose a unified notion of importance that circumvents these limitations by exploring a continuum along a necessity-sufficiency axis.
arXiv Detail & Related papers (2024-09-30T15:50:57Z) - Towards a Mechanistic Interpretation of Multi-Step Reasoning
Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities.
It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism.
We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z) - Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z) - Language Models Implement Simple Word2Vec-style Vector Arithmetic [32.2976613483151]
A primary criticism towards language models (LMs) is their inscrutability.
This paper presents evidence that, despite their size and complexity, LMs sometimes exploit a simple vector arithmetic style mechanism to solve some relational tasks.
arXiv Detail & Related papers (2023-05-25T15:04:01Z) - Interpretability at Scale: Identifying Causal Mechanisms in Alpaca [62.65877150123775]
We use Boundless DAS to efficiently search for interpretable causal structure in large language models while they follow instructions.
Our findings mark a first step toward faithfully understanding the inner-workings of our ever-growing and most widely deployed language models.
arXiv Detail & Related papers (2023-05-15T17:15:40Z) - Causal Triplet: An Open Challenge for Intervention-centric Causal
Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes.
We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z) - ExSum: From Local Explanations to Model Understanding [6.23934576145261]
Interpretability methods are developed to understand the working mechanisms of black-box models.
Fulfilling this goal requires both that the explanations generated by these methods are correct and that people can easily and reliably understand them.
We introduce explanation summary (ExSum), a mathematical framework for quantifying model understanding.
arXiv Detail & Related papers (2022-04-30T02:07:20Z) - Tell me why! -- Explanations support learning of relational and causal
structure [24.434551113103105]
Explanations play a considerable role in human learning, especially in areas that remain major challenges for AI.
We show that reinforcement learning agents might likewise benefit from explanations.
Our results suggest that learning from explanations is a powerful principle that could offer a promising path towards training more robust and general machine learning systems.
arXiv Detail & Related papers (2021-12-07T15:09:06Z) - The Struggles of Feature-Based Explanations: Shapley Values vs. Minimal
Sufficient Subsets [61.66584140190247]
We show that feature-based explanations pose problems even for explaining trivial models.
We show that two popular classes of explainers, Shapley explainers and minimal sufficient subsets explainers, target fundamentally different types of ground-truth explanations.
arXiv Detail & Related papers (2020-09-23T09:45:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.