Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
- URL: http://arxiv.org/abs/2410.01434v1
- Date: Wed, 2 Oct 2024 11:36:45 GMT
- Title: Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
- Authors: Philipp Mondorf, Sondre Wold, Barbara Plank,
- Abstract summary: We identify and compare circuits responsible for ten modular string-edit operations within a language model.
Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness.
- Score: 22.89563355840371
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions via subnetworks that can be composed to perform more complex tasks. Recent developments in mechanistic interpretability have made progress in identifying subnetworks, often referred to as circuits, which represent the minimal computational subgraph responsible for a model's behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits relate to each other. To address this gap, we examine the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through subnetwork set operations to represent more complex functional capabilities of the model.
Related papers
- Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.
These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z) - Uncovering Intermediate Variables in Transformers using Circuit Probing [32.382094867951224]
We propose a new analysis technique -- circuit probing -- that automatically uncovers low-level circuits that compute hypothesized intermediate variables.
We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training.
arXiv Detail & Related papers (2023-11-07T21:27:17Z) - Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models [9.56229382432426]
This research aims to reverse engineer transformer models into human-readable representations that implement algorithmic functions.
By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B.
We show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems.
arXiv Detail & Related papers (2023-11-07T16:58:51Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Can Transformers Learn to Solve Problems Recursively? [9.5623664764386]
This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification.
By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions.
arXiv Detail & Related papers (2023-05-24T04:08:37Z) - On Neural Architecture Inductive Biases for Relational Tasks [76.18938462270503]
We introduce a simple architecture based on similarity-distribution scores which we name Compositional Network generalization (CoRelNet)
We find that simple architectural choices can outperform existing models in out-of-distribution generalizations.
arXiv Detail & Related papers (2022-06-09T16:24:01Z) - Redefining Neural Architecture Search of Heterogeneous Multi-Network
Models by Characterizing Variation Operators and Model Components [71.03032589756434]
We investigate the effect of different variation operators in a complex domain, that of multi-network heterogeneous neural models.
We characterize both the variation operators, according to their effect on the complexity and performance of the model; and the models, relying on diverse metrics which estimate the quality of the different parts composing it.
arXiv Detail & Related papers (2021-06-16T17:12:26Z) - A Compositional Atlas of Tractable Circuit Operations: From Simple
Transformations to Complex Information-Theoretic Queries [44.36335714431731]
We show how complex inference scenarios for machine learning can be represented in terms of tractable modular operations over circuits.
We derive a unified framework for reasoning about tractable models that generalizes several results in the literature and opens up novel tractable inference scenarios.
arXiv Detail & Related papers (2021-02-11T17:26:32Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z) - Obtaining Faithful Interpretations from Compositional Neural Networks [72.41100663462191]
We evaluate the intermediate outputs of NMNs on NLVR2 and DROP datasets.
We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour.
arXiv Detail & Related papers (2020-05-02T06:50:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.