InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
- URL: http://arxiv.org/abs/2407.14494v2
- Date: Wed, 30 Oct 2024 13:44:04 GMT
- Title: InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques
- Authors: Rohan Gupta, Iván Arcuschin, Thomas Kwa, Adrià Garriga-Alonso,
- Abstract summary: We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT)
We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic.
- Score: 1.075853016351821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train simple neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.
Related papers
- Transformer Circuit Faithfulness Metrics are not Robust [0.04260910081285213]
We measure circuit 'faithfulness' by ablating portions of the model's computation.
We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit.
The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits.
arXiv Detail & Related papers (2024-07-11T17:59:00Z) - Transformers meet Neural Algorithmic Reasoners [16.5785372289558]
We propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs)
We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning.
arXiv Detail & Related papers (2024-06-13T16:42:06Z) - Mechanistic Interpretability of Binary and Ternary Transformers [1.3715396507106912]
We investigate whether binary and ternary transformer networks learn distinctly different or similar algorithms when compared to full-precision transformer networks.
This provides evidence against the possibility of using binary and ternary networks as a more interpretable alternative in the Large Language Models setting.
arXiv Detail & Related papers (2024-05-27T23:22:23Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - Transformers as Algorithms: Generalization and Implicit Model Selection
in In-context Learning [23.677503557659705]
In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of examples and performs inference on-the-fly.
We treat the transformer model as a learning algorithm that can be specialized via training to implement-at inference-time-another target algorithm.
We show that transformers can act as an adaptive learning algorithm and perform model selection across different hypothesis classes.
arXiv Detail & Related papers (2023-01-17T18:31:12Z) - Pretraining Graph Neural Networks for few-shot Analog Circuit Modeling
and Design [68.1682448368636]
We present a supervised pretraining approach to learn circuit representations that can be adapted to new unseen topologies or unseen prediction tasks.
To cope with the variable topological structure of different circuits we describe each circuit as a graph and use graph neural networks (GNNs) to learn node embeddings.
We show that pretraining GNNs on prediction of output node voltages can encourage learning representations that can be adapted to new unseen topologies or prediction of new circuit level properties.
arXiv Detail & Related papers (2022-03-29T21:18:47Z) - Neural Network Training with Asymmetric Crosspoint Elements [1.0773924713784704]
asymmetric conductance modulation of practical resistive devices critically degrades the classification of networks trained with conventional algorithms.
Here, we describe and experimentally demonstrate an alternative fully-parallel training algorithm: Hamiltonian Descent.
We provide critical intuition on why device asymmetry is fundamentally incompatible with conventional training algorithms and how the new approach exploits it as a useful feature instead.
arXiv Detail & Related papers (2022-01-31T17:41:36Z) - Data-Driven Symbol Detection via Model-Based Machine Learning [117.58188185409904]
We review a data-driven framework to symbol detection design which combines machine learning (ML) and model-based algorithms.
In this hybrid approach, well-known channel-model-based algorithms are augmented with ML-based algorithms to remove their channel-model-dependence.
Our results demonstrate that these techniques can yield near-optimal performance of model-based algorithms without knowing the exact channel input-output statistical relationship.
arXiv Detail & Related papers (2020-02-14T06:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.