Related papers: Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach

URL: http://arxiv.org/abs/2407.13594v1
Date: Thu, 18 Jul 2024 15:32:44 GMT
Title: Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Authors: Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha,
Abstract summary: Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. We give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis. We present evidence to support that the mechanistic interpretation of the analyzed model indeed satisfies the stated axioms.
Score: 28.336108192282737
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We use these axioms to guide the mechanistic interpretability analysis of a Transformer-based model trained to solve the well-known 2-SAT problem. We are able to reverse engineer the algorithm learned by the model -- the model first parses the input formulas and then evaluates their satisfiability via enumeration of different possible valuations of the Boolean input variables. We also present evidence to support that the mechanistic interpretation of the analyzed model indeed satisfies the stated axioms.

Related papers

Embryology of a Language Model [1.1874560263468232]
In this work, we introduce an embryological approach, applying UMAP to the susceptibility matrix to visualize the model's structural development over training.<n>Our visualizations reveal the emergence of a clear body plan'' charting the formation of known features like the induction circuit and discovering previously unknown structures.
arXiv Detail & Related papers (2025-08-01T05:39:41Z)
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i [0.0]
We argue that Mechanistic Interpretability research is a principled approach to understanding models.<n>We show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined.
arXiv Detail & Related papers (2025-05-01T19:08:34Z)
Neural Interpretable Reasoning [12.106771300842945]
We formalize a novel modeling framework for achieving interpretability in deep learning. We show that this complexity can be mitigated by treating interpretability as a Markovian property. We propose a new modeling paradigm -- neural generation and interpretable execution.
arXiv Detail & Related papers (2025-02-17T10:33:24Z)
Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning [9.795934690403374]
It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks. We employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process. We demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.
arXiv Detail & Related papers (2025-02-13T07:19:05Z)
A Comprehensive Survey on Self-Interpretable Neural Networks [36.0575431131253]
Self-interpretable neural networks inherently reveal the prediction rationale through the model structures.<n>We first collect and review existing works on self-interpretable neural networks and provide a structured summary of their methodologies.<n>We also present concrete, visualized examples of model explanations and discuss their applicability across diverse scenarios.
arXiv Detail & Related papers (2025-01-26T18:50:16Z)
The Buffer Mechanism for Multi-Step Information Reasoning in Language Models [52.77133661679439]
Investigating internal reasoning mechanisms of large language models can help us design better model architectures and training strategies. In this study, we constructed a symbolic dataset to investigate the mechanisms by which Transformer models employ vertical thinking strategy. We proposed a random matrix-based algorithm to enhance the model's reasoning ability, resulting in a 75% reduction in the training time required for the GPT-2 model.
arXiv Detail & Related papers (2024-05-24T07:41:26Z)
Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers. We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models. Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z)
Axiomatic Causal Interventions for Reverse Engineering Relevance Computation in Neural Retrieval Models [20.29451537633895]
We propose the use of causal interventions to reverse engineer neural rankers. We demonstrate how mechanistic interpretability methods can be used to isolate components satisfying term-frequency axioms.
arXiv Detail & Related papers (2024-05-03T22:30:15Z)
An Encoding of Abstract Dialectical Frameworks into Higher-Order Logic [57.24311218570012]
This approach allows for the computer-assisted analysis of abstract dialectical frameworks. Exemplary applications include the formal analysis and verification of meta-theoretical properties.
arXiv Detail & Related papers (2023-12-08T09:32:26Z)
Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars [36.780346257061495]
Interpretability methods aim to understand the algorithm implemented by a trained model. We take a critical view of methods that exclusively focus on individual parts of the model.
arXiv Detail & Related papers (2023-12-03T15:34:46Z)
Uncovering Intermediate Variables in Transformers using Circuit Probing [28.81226181942661]
We propose a new analysis technique - circuit probing - that automatically uncovers low-level circuits that compute hypothesized intermediate variables. We apply this method to models trained on simple arithmetic tasks, demonstrating its effectiveness at (1) deciphering the algorithms that models have learned, (2) revealing modular structure within a model, and (3) tracking the development of circuits over training.
arXiv Detail & Related papers (2023-11-07T21:27:17Z)
Discovering interpretable elastoplasticity models via the neural polynomial method enabled symbolic regressions [0.0]
Conventional neural network elastoplasticity models are often perceived as lacking interpretability. This paper introduces a two-step machine learning approach that returns mathematical models interpretable by human experts.
arXiv Detail & Related papers (2023-07-24T22:22:32Z)
Adversarial Attacks on the Interpretation of Neuron Activation Maximization [70.5472799454224]
Activation-maximization approaches are used to interpret and analyze trained deep-learning models. In this work, we consider the concept of an adversary manipulating a model for the purpose of deceiving the interpretation.
arXiv Detail & Related papers (2023-06-12T19:54:33Z)
Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability [30.76910454663951]
Causal abstraction provides a theoretical foundation for mechanistic interpretability. Our contributions are generalizing the theory of causal abstraction from mechanism replacement to arbitrary mechanism transformation.
arXiv Detail & Related papers (2023-01-11T20:42:41Z)
Local Interpretable Model Agnostic Shap Explanations for machine learning models [0.0]
We propose a methodology that we define as Local Interpretable Model Agnostic Shap Explanations (LIMASE) This proposed technique uses Shapley values under the LIME paradigm to achieve the following (a) explain prediction of any model by using a locally faithful and interpretable decision tree model on which the Tree Explainer is used to calculate the shapley values and give visually interpretable explanations.
arXiv Detail & Related papers (2022-10-10T10:07:27Z)
Hessian-based toolbox for reliable and interpretable machine learning in physics [58.720142291102135]
We present a toolbox for interpretability and reliability, extrapolation of the model architecture. It provides a notion of the influence of the input data on the prediction at a given test point, an estimation of the uncertainty of the model predictions, and an agnostic score for the model predictions. Our work opens the road to the systematic use of interpretability and reliability methods in ML applied to physics and, more generally, science.
arXiv Detail & Related papers (2021-08-04T16:32:59Z)
Interpretable Deep Learning: Interpretations, Interpretability, Trustworthiness, and Beyond [49.93153180169685]
We introduce and clarify two basic concepts-interpretations and interpretability-that people usually get confused. We elaborate the design of several recent interpretation algorithms, from different perspectives, through proposing a new taxonomy. We summarize the existing work in evaluating models' interpretability using "trustworthy" interpretation algorithms.
arXiv Detail & Related papers (2021-03-19T08:40:30Z)
A Chain Graph Interpretation of Real-World Neural Networks [58.78692706974121]
We propose an alternative interpretation that identifies NNs as chain graphs (CGs) and feed-forward as an approximate inference procedure. The CG interpretation specifies the nature of each NN component within the rich theoretical framework of probabilistic graphical models. We demonstrate with concrete examples that the CG interpretation can provide novel theoretical support and insights for various NN techniques.
arXiv Detail & Related papers (2020-06-30T14:46:08Z)
Obtaining Faithful Interpretations from Compositional Neural Networks [72.41100663462191]
We evaluate the intermediate outputs of NMNs on NLVR2 and DROP datasets. We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour.
arXiv Detail & Related papers (2020-05-02T06:50:35Z)
Semantics-Aware Inferential Network for Natural Language Understanding [79.70497178043368]
We propose a Semantics-Aware Inferential Network (SAIN) to meet such a motivation. Taking explicit contextualized semantics as a complementary input, the inferential module of SAIN enables a series of reasoning steps over semantic clues. Our model achieves significant improvement on 11 tasks including machine reading comprehension and natural language inference.
arXiv Detail & Related papers (2020-04-28T07:24:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.