Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
- URL: http://arxiv.org/abs/2410.01434v2
- Date: Mon, 17 Feb 2025 13:30:15 GMT
- Title: Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models
- Authors: Philipp Mondorf, Sondre Wold, Barbara Plank,
- Abstract summary: We study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a language model.
Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness.
- Score: 22.89563355840371
- License:
- Abstract: A fundamental question in interpretability research is to what extent neural networks, particularly language models, implement reusable functions through subnetworks that can be composed to perform more complex tasks. Recent advances in mechanistic interpretability have made progress in identifying $\textit{circuits}$, which represent the minimal computational subgraphs responsible for a model's behavior on specific tasks. However, most studies focus on identifying circuits for individual tasks without investigating how functionally similar circuits $\textit{relate}$ to each other. To address this gap, we study the modularity of neural networks by analyzing circuits for highly compositional subtasks within a transformer-based language model. Specifically, given a probabilistic context-free grammar, we identify and compare circuits responsible for ten modular string-edit operations. Our results indicate that functionally similar circuits exhibit both notable node overlap and cross-task faithfulness. Moreover, we demonstrate that the circuits identified can be reused and combined through set operations to represent more complex functional model capabilities.
Related papers
- Position-aware Automatic Circuit Discovery [59.64762573617173]
We identify a gap in existing circuit discovery methods, treating model components as equally relevant across input positions.
We propose two improvements to incorporate positionality into circuits, even on tasks containing variable-length examples.
Our approach enables fully automated discovery of position-sensitive circuits, yielding better trade-offs between circuit size and faithfulness compared to prior work.
arXiv Detail & Related papers (2025-02-07T00:18:20Z) - Transformers Use Causal World Models in Maze-Solving Tasks [49.67445252528868]
We investigate the inner workings of transformer models trained on tasks across various domains.
We find that transformers are able to reason with respect to a greater number of active features than they see during training.
We observe that varying positional encodings can alter how WMs are encoded in a model's residual stream.
arXiv Detail & Related papers (2024-12-16T15:21:04Z) - Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Transformer Circuit Faithfulness Metrics are not Robust [0.04260910081285213]
We measure circuit 'faithfulness' by ablating portions of the model's computation.
We conclude that existing circuit faithfulness scores reflect both the methodological choices of researchers as well as the actual components of the circuit.
The ultimate goal of mechanistic interpretability work is to understand neural networks, so we emphasize the need for more clarity in the precise claims being made about circuits.
arXiv Detail & Related papers (2024-07-11T17:59:00Z) - Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [55.19497659895122]
We introduce methods for discovering and applying sparse feature circuits.
These are causally implicatedworks of human-interpretable features for explaining language model behaviors.
arXiv Detail & Related papers (2024-03-28T17:56:07Z) - Uncovering Intermediate Variables in Transformers using Circuit Probing [28.81226181942661]
We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables.
We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training.
arXiv Detail & Related papers (2023-11-07T21:27:17Z) - Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models [9.56229382432426]
This research aims to reverse engineer transformer models into human-readable representations that implement algorithmic functions.
By applying circuit interpretability analysis, we identify a key sub-circuit in both GPT-2 Small and Llama-2-7B.
We show that this sub-circuit has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural language word problems.
arXiv Detail & Related papers (2023-11-07T16:58:51Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Can Transformers Learn to Solve Problems Recursively? [9.5623664764386]
This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification.
By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions.
arXiv Detail & Related papers (2023-05-24T04:08:37Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.