Localizing Model Behavior with Path Patching
- URL: http://arxiv.org/abs/2304.05969v2
- Date: Tue, 16 May 2023 16:24:55 GMT
- Title: Localizing Model Behavior with Path Patching
- Authors: Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, Aryaman Arora
- Abstract summary: We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths.
We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.
- Score: 1.5293427903448025
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Localizing behaviors of neural networks to a subset of the network's
components or a subset of interactions between components is a natural first
step towards analyzing network mechanisms and possible failure modes. Existing
work is often qualitative and ad-hoc, and there is no consensus on the
appropriate way to evaluate localization claims. We introduce path patching, a
technique for expressing and quantitatively testing a natural class of
hypotheses expressing that behaviors are localized to a set of paths. We refine
an explanation of induction heads, characterize a behavior of GPT-2, and open
source a framework for efficiently running similar experiments.
Related papers
- APEX: Probing Neural Networks via Activation Perturbation [10.517751599566548]
We introduce Activation Perturbation for EXploration (APEX) as an inference-time probing paradigm for neural networks.<n>APEX perturbs hidden activations while keeping both inputs and model parameters fixed.<n>Our results show that APEX offers an effective perspective for exploring, and understanding neural networks beyond what is accessible from input space alone.
arXiv Detail & Related papers (2026-02-03T14:36:36Z) - Neural Network Reprogrammability: A Unified Theme on Model Reprogramming, Prompt Tuning, and Prompt Instruction [55.914891182214475]
We introduce neural network reprogrammability as a unifying framework for model adaptation.<n>We present a taxonomy that categorizes such information manipulation approaches across four key dimensions.<n>We also analyze remaining technical challenges and ethical considerations.
arXiv Detail & Related papers (2025-06-05T05:42:27Z) - Unifying Perplexing Behaviors in Modified BP Attributions through Alignment Perspective [61.5509267439999]
We present a unified theoretical framework for methods like GBP, RectGrad, LRP, and DTD.
We demonstrate that they achieve input alignment by combining the weights of activated neurons.
This alignment improves the visualization quality and reduces sensitivity to weight randomization.
arXiv Detail & Related papers (2025-03-14T07:58:26Z) - Can We Validate Counterfactual Estimations in the Presence of General Network Interference? [6.092214762701847]
We introduce a new framework enabling cross-validation for counterfactual estimation.
At its core is our distribution-preserving network bootstrap method.
We extend recent causal message-passing developments by incorporating heterogeneous unit-level characteristics.
arXiv Detail & Related papers (2025-02-03T06:51:04Z) - Identifying Sub-networks in Neural Networks via Functionally Similar Representations [41.028797971427124]
We take a step toward automating the understanding of the network by investigating the existence of distinct sub-networks.
Our approach offers meaningful insights into the behavior of neural networks with minimal human and computational cost.
arXiv Detail & Related papers (2024-10-21T20:19:00Z) - Relative Representations: Topological and Geometric Perspectives [53.88896255693922]
Relative representations are an established approach to zero-shot model stitching.
We introduce a normalization procedure in the relative transformation, resulting in invariance to non-isotropic rescalings and permutations.
Second, we propose to deploy topological densification when fine-tuning relative representations, a topological regularization loss encouraging clustering within classes.
arXiv Detail & Related papers (2024-09-17T08:09:22Z) - Provable Bounds on the Hessian of Neural Networks: Derivative-Preserving Reachability Analysis [6.9060054915724]
We propose a novel reachability analysis method tailored for neural networks with differentiable activations.
A key aspect of our method is loop transformation on the activation functions to exploit their monotonicity effectively.
The resulting end-to-end abstraction locally preserves the derivative information, yielding accurate bounds on small input sets.
arXiv Detail & Related papers (2024-06-06T20:02:49Z) - Towards Subject Agnostic Affective Emotion Recognition [8.142798657174332]
EEG signals manifest subject instability in subject-agnostic affective Brain-computer interfaces (aBCIs)
We propose a novel framework, meta-learning based augmented domain adaptation for subject-agnostic aBCIs.
Our proposed approach is shown to be effective in experiments on a public aBICs dataset.
arXiv Detail & Related papers (2023-10-20T23:44:34Z) - Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via
Optimization Trajectory Distillation [73.83178465971552]
The success of automated medical image analysis depends on large-scale and expert-annotated training sets.
Unsupervised domain adaptation (UDA) has been raised as a promising approach to alleviate the burden of labeled data collection.
We propose optimization trajectory distillation, a unified approach to address the two technical challenges from a new perspective.
arXiv Detail & Related papers (2023-07-27T08:58:05Z) - Detection of Uncertainty in Exceedance of Threshold (DUET): An
Adversarial Patch Localizer [8.513938423514636]
Development of defenses against physical world attacks such as adversarial patches is gaining traction within the research community.
We contribute to the field of adversarial patch detection by introducing an uncertainty-based adversarial patch localizer.
This algorithm provides a framework to ascertain confidence in the adversarial patch localization.
arXiv Detail & Related papers (2023-03-18T00:07:26Z) - A Computational Framework of Cortical Microcircuits Approximates
Sign-concordant Random Backpropagation [7.601127912271984]
We propose a hypothetical framework consisting of a new microcircuit architecture and its supporting Hebbian learning rules.
We employ the Hebbian rule operating in local compartments to update synaptic weights and achieve supervised learning in a biologically plausible manner.
The proposed framework is benchmarked on several datasets including MNIST and CIFAR10, demonstrating promising BP-comparable accuracy.
arXiv Detail & Related papers (2022-05-15T14:22:03Z) - Triggering Failures: Out-Of-Distribution detection by learning from
local adversarial attacks in Semantic Segmentation [76.2621758731288]
We tackle the detection of out-of-distribution (OOD) objects in semantic segmentation.
Our main contribution is a new OOD detection architecture called ObsNet associated with a dedicated training scheme based on Local Adversarial Attacks (LAA)
We show it obtains top performances both in speed and accuracy when compared to ten recent methods of the literature on three different datasets.
arXiv Detail & Related papers (2021-08-03T17:09:56Z) - Revisiting Indirect Ontology Alignment : New Challenging Issues in
Cross-Lingual Context [0.0]
This article introduces a new method of indirect alignment of in a cross-lingual context.
The proposed method is based on alignment algebra which governs the composition of relationships and confidence values.
The obtained results are very encouraging and highlight many positive aspects about the new proposed method.
arXiv Detail & Related papers (2021-04-04T15:21:09Z) - Cross-Domain Similarity Learning for Face Recognition in Unseen Domains [90.35908506994365]
We introduce a novel cross-domain metric learning loss, which we dub Cross-Domain Triplet (CDT) loss, to improve face recognition in unseen domains.
The CDT loss encourages learning semantically meaningful features by enforcing compact feature clusters of identities from one domain.
Our method does not require careful hard-pair sample mining and filtering strategy during training.
arXiv Detail & Related papers (2021-03-12T19:48:01Z) - Visualization of Supervised and Self-Supervised Neural Networks via
Attribution Guided Factorization [87.96102461221415]
We develop an algorithm that provides per-class explainability.
In an extensive battery of experiments, we demonstrate the ability of our methods to class-specific visualization.
arXiv Detail & Related papers (2020-12-03T18:48:39Z) - Developing Constrained Neural Units Over Time [81.19349325749037]
This paper focuses on an alternative way of defining Neural Networks, that is different from the majority of existing approaches.
The structure of the neural architecture is defined by means of a special class of constraints that are extended also to the interaction with data.
The proposed theory is cast into the time domain, in which data are presented to the network in an ordered manner.
arXiv Detail & Related papers (2020-09-01T09:07:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.