Related papers: Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations

Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations

URL: http://arxiv.org/abs/2410.07476v2
Date: Fri, 11 Oct 2024 22:29:01 GMT
Title: Unifying and Verifying Mechanistic Interpretations: A Case Study with Group Operations
Authors: Wilson Wu, Louis Jaburi, Jacob Drori, Jason Gross,
Abstract summary: A recent line of work in mechanistic interpretability has focused on reverse-engineering the computation performed by neural networks trained on the binary operation of finite groups. We investigate the internals of one-hidden-layer neural networks trained on this task, revealing previously unidentified structure. We produce a more complete description of such models that unifies the explanations of previous works.
Score: 0.8305049591788082
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A recent line of work in mechanistic interpretability has focused on reverse-engineering the computation performed by neural networks trained on the binary operation of finite groups. We investigate the internals of one-hidden-layer neural networks trained on this task, revealing previously unidentified structure and producing a more complete description of such models that unifies the explanations of previous works. Notably, these models approximate equivariance in each input argument. We verify that our explanation applies to a large fraction of networks trained on this task by translating it into a compact proof of model performance, a quantitative evaluation of model understanding. In particular, our explanation yields a guarantee of model accuracy that runs in 30% the time of brute force and gives a >=95% accuracy bound for 45% of the models we trained. We were unable to obtain nontrivial non-vacuous accuracy bounds using only explanations from previous works.

Related papers

Combining Causal Models for More Accurate Abstractions of Neural Networks [10.115827125021438]
Causal abstraction provides a precise notion of when a network implements an algorithm. A typical problem in practical settings is that the algorithm is not an entirely faithful abstraction of the network. We propose a solution where we combine different simple high-level models to produce a more faithful representation of the network.
arXiv Detail & Related papers (2025-03-14T14:14:43Z)
Improving Network Interpretability via Explanation Consistency Evaluation [56.14036428778861]
We propose a framework that acquires more explainable activation heatmaps and simultaneously increase the model performance. Specifically, our framework introduces a new metric, i.e., explanation consistency, to reweight the training samples adaptively in model learning. Our framework then promotes the model learning by paying closer attention to those training samples with a high difference in explanations.
arXiv Detail & Related papers (2024-08-08T17:20:08Z)
SINDER: Repairing the Singular Defects of DINOv2 [61.98878352956125]
Vision Transformer models trained on large-scale datasets often exhibit artifacts in the patch token they extract. We propose a novel fine-tuning smooth regularization that rectifies structural deficiencies using only a small dataset.
arXiv Detail & Related papers (2024-07-23T20:34:23Z)
Arithmetic in Transformers Explained [1.8434042562191815]
We analyze 44 autoregressive transformer models trained on addition, subtraction, or both. We show that the addition models converge on a common logical algorithm, with most models achieving >99.999% prediction accuracy. We introduce a reusable library of mechanistic interpretability tools to define, locate, and visualize these algorithmic circuits.
arXiv Detail & Related papers (2024-02-04T21:33:18Z)
Robust Nonparametric Hypothesis Testing to Understand Variability in Training Neural Networks [5.8490454659691355]
We propose a new measure of closeness between classification models based on the output of the network before thresholding. Our measure is based on a robust hypothesis-testing framework and can be adapted to other quantities derived from trained models.
arXiv Detail & Related papers (2023-10-01T01:44:35Z)
VCNet: A self-explaining model for realistic counterfactual generation [52.77024349608834]
Counterfactual explanation is a class of methods to make local explanations of machine learning decisions. We present VCNet-Variational Counter Net, a model architecture that combines a predictor and a counterfactual generator. We show that VCNet is able to both generate predictions, and to generate counterfactual explanations without having to solve another minimisation problem.
arXiv Detail & Related papers (2022-12-21T08:45:32Z)
Reconciliation of Pre-trained Models and Prototypical Neural Networks in Few-shot Named Entity Recognition [35.34238362639678]
We propose a one-line-code normalization method to reconcile such a mismatch with empirical and theoretical grounds. Our work also provides an analytical viewpoint for addressing the general problems in few-shot name entity recognition.
arXiv Detail & Related papers (2022-11-07T02:33:45Z)
Harnessing the Power of Explanations for Incremental Training: A LIME-Based Approach [6.244905619201076]
In this work, model explanations are fed back to the feed-forward training to help the model generalize better. The framework incorporates the custom weighted loss with Elastic Weight Consolidation (EWC) to maintain performance in sequential testing sets. The proposed custom training procedure results in a consistent enhancement of accuracy ranging from 0.5% to 1.5% throughout all phases of the incremental learning setup.
arXiv Detail & Related papers (2022-11-02T18:16:17Z)
SynBench: Task-Agnostic Benchmarking of Pretrained Representations using Synthetic Data [78.21197488065177]
Recent success in fine-tuning large models, that are pretrained on broad data at scale, on downstream tasks has led to a significant paradigm shift in deep learning. This paper proposes a new task-agnostic framework, textitSynBench, to measure the quality of pretrained representations using synthetic data.
arXiv Detail & Related papers (2022-10-06T15:25:00Z)
Invariant Causal Mechanisms through Distribution Matching [86.07327840293894]
In this work we provide a causal perspective and a new algorithm for learning invariant representations. Empirically we show that this algorithm works well on a diverse set of tasks and in particular we observe state-of-the-art performance on domain generalization.
arXiv Detail & Related papers (2022-06-23T12:06:54Z)
Don't Explain Noise: Robust Counterfactuals for Randomized Ensembles [50.81061839052459]
We formalize the generation of robust counterfactual explanations as a probabilistic problem. We show the link between the robustness of ensemble models and the robustness of base learners. Our method achieves high robustness with only a small increase in the distance from counterfactual explanations to their initial observations.
arXiv Detail & Related papers (2022-05-27T17:28:54Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
Dense Unsupervised Learning for Video Segmentation [49.46930315961636]
We present a novel approach to unsupervised learning for video object segmentation (VOS) Unlike previous work, our formulation allows to learn dense feature representations directly in a fully convolutional regime. Our approach exceeds the segmentation accuracy of previous work despite using significantly less training data and compute power.
arXiv Detail & Related papers (2021-11-11T15:15:11Z)
Building Reliable Explanations of Unreliable Neural Networks: Locally Smoothing Perspective of Model Interpretation [0.0]
We present a novel method for reliably explaining the predictions of neural networks. Our method is built on top of the assumption of smooth landscape in a loss function of the model prediction.
arXiv Detail & Related papers (2021-03-26T08:52:11Z)
Deep Ensembles for Low-Data Transfer Learning [21.578470914935938]
We study different ways of creating ensembles from pre-trained models. We show that the nature of pre-training itself is a performant source of diversity. We propose a practical algorithm that efficiently identifies a subset of pre-trained models for any downstream dataset.
arXiv Detail & Related papers (2020-10-14T07:59:00Z)
Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking [63.49779304362376]
Graph neural networks (GNNs) have become a popular approach to integrating structural inductive biases into NLP models. We introduce a post-hoc method for interpreting the predictions of GNNs which identifies unnecessary edges. We show that we can drop a large proportion of edges without deteriorating the performance of the model.
arXiv Detail & Related papers (2020-10-01T17:51:19Z)
Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network. PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z)
The Gaussian equivalence of generative models for learning with shallow neural networks [30.47878306277163]
We study the performance of neural networks trained on data drawn from pre-trained generative models. We provide three strands of rigorous, analytical and numerical evidence corroborating this equivalence. These results open a viable path to the theoretical study of machine learning models with realistic data.
arXiv Detail & Related papers (2020-06-25T21:20:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.