Towards falsifiable interpretability research
- URL: http://arxiv.org/abs/2010.12016v1
- Date: Thu, 22 Oct 2020 22:03:41 GMT
- Title: Towards falsifiable interpretability research
- Authors: Matthew L. Leavitt, Ari Morcos
- Abstract summary: We argue that interpretability research suffers from an over-reliance on intuition-based approaches.
We examine two popular classes of interpretability methods-saliency and single-neuron-based approaches.
We propose a strategy to address these impediments in the form of a framework for strongly falsifiable interpretability research.
- Score: 7.360807642941714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Methods for understanding the decisions of and mechanisms underlying deep
neural networks (DNNs) typically rely on building intuition by emphasizing
sensory or semantic features of individual examples. For instance, methods aim
to visualize the components of an input which are "important" to a network's
decision, or to measure the semantic properties of single neurons. Here, we
argue that interpretability research suffers from an over-reliance on
intuition-based approaches that risk-and in some cases have caused-illusory
progress and misleading conclusions. We identify a set of limitations that we
argue impede meaningful progress in interpretability research, and examine two
popular classes of interpretability methods-saliency and single-neuron-based
approaches-that serve as case studies for how overreliance on intuition and
lack of falsifiability can undermine interpretability research. To address
these concerns, we propose a strategy to address these impediments in the form
of a framework for strongly falsifiable interpretability research. We encourage
researchers to use their intuitions as a starting point to develop and test
clear, falsifiable hypotheses, and hope that our framework yields robust,
evidence-based interpretability methods that generate meaningful advances in
our understanding of DNNs.
Related papers
- Missed Causes and Ambiguous Effects: Counterfactuals Pose Challenges for Interpreting Neural Networks [14.407025310553225]
Interpretability research takes counterfactual theories of causality for granted.
Counterfactual theories have problems that bias our findings in specific and predictable ways.
We discuss the implications of these challenges for interpretability researchers.
arXiv Detail & Related papers (2024-07-05T17:53:03Z) - A Survey on Transferability of Adversarial Examples across Deep Neural Networks [53.04734042366312]
adversarial examples can manipulate machine learning models into making erroneous predictions.
The transferability of adversarial examples enables black-box attacks which circumvent the need for detailed knowledge of the target model.
This survey explores the landscape of the adversarial transferability of adversarial examples.
arXiv Detail & Related papers (2023-10-26T17:45:26Z) - Adversarial Attacks on the Interpretation of Neuron Activation
Maximization [70.5472799454224]
Activation-maximization approaches are used to interpret and analyze trained deep-learning models.
In this work, we consider the concept of an adversary manipulating a model for the purpose of deceiving the interpretation.
arXiv Detail & Related papers (2023-06-12T19:54:33Z) - Not All Neuro-Symbolic Concepts Are Created Equal: Analysis and
Mitigation of Reasoning Shortcuts [24.390922632057627]
Neuro-Symbolic (NeSy) predictive models hold the promise of improved compliance with given constraints.
They allow to infer labels that are consistent with some prior knowledge by reasoning over high-level concepts extracted from sub-symbolic inputs.
It was recently shown that NeSy predictors are affected by reasoning shortcuts: they can attain high accuracy but by leveraging concepts with unintended semantics, thus coming short of their promised advantages.
arXiv Detail & Related papers (2023-05-31T15:35:48Z) - Interpreting Neural Policies with Disentangled Tree Representations [58.769048492254555]
We study interpretability of compact neural policies through the lens of disentangled representation.
We leverage decision trees to obtain factors of variation for disentanglement in robot learning.
We introduce interpretability metrics that measure disentanglement of learned neural dynamics.
arXiv Detail & Related papers (2022-10-13T01:10:41Z) - A Latent-Variable Model for Intrinsic Probing [93.62808331764072]
We propose a novel latent-variable formulation for constructing intrinsic probes.
We find empirical evidence that pre-trained representations develop a cross-lingually entangled notion of morphosyntax.
arXiv Detail & Related papers (2022-01-20T15:01:12Z) - Robust Explainability: A Tutorial on Gradient-Based Attribution Methods
for Deep Neural Networks [1.5854438418597576]
We present gradient-based interpretability methods for explaining decisions of deep neural networks.
We discuss the role that adversarial robustness plays in having meaningful explanations.
We conclude with the future directions for research in the area at the convergence of robustness and explainability.
arXiv Detail & Related papers (2021-07-23T18:06:29Z) - ACRE: Abstract Causal REasoning Beyond Covariation [90.99059920286484]
We introduce the Abstract Causal REasoning dataset for systematic evaluation of current vision systems in causal induction.
Motivated by the stream of research on causal discovery in Blicket experiments, we query a visual reasoning system with the following four types of questions in either an independent scenario or an interventional scenario.
We notice that pure neural models tend towards an associative strategy under their chance-level performance, whereas neuro-symbolic combinations struggle in backward-blocking reasoning.
arXiv Detail & Related papers (2021-03-26T02:42:38Z) - Interpretable Deep Learning: Interpretations, Interpretability,
Trustworthiness, and Beyond [49.93153180169685]
We introduce and clarify two basic concepts-interpretations and interpretability-that people usually get confused.
We elaborate the design of several recent interpretation algorithms, from different perspectives, through proposing a new taxonomy.
We summarize the existing work in evaluating models' interpretability using "trustworthy" interpretation algorithms.
arXiv Detail & Related papers (2021-03-19T08:40:30Z) - i-Algebra: Towards Interactive Interpretability of Deep Neural Networks [41.13047686374529]
We present i-Algebra, a first-of-its-kind interactive framework for interpreting deep neural networks (DNNs)
At its core is a library of atomic, composable operators, which explain model behaviors at varying input granularity, during different inference stages, and from distinct interpretation perspectives.
We conduct user studies in a set of representative analysis tasks, including inspecting adversarial inputs, resolving model inconsistency, and cleansing contaminated data, all demonstrating its promising usability.
arXiv Detail & Related papers (2021-01-22T19:22:57Z) - Adversarial Examples on Object Recognition: A Comprehensive Survey [1.976652238476722]
Deep neural networks are at the forefront of machine learning research.
adversarial examples are intentionally designed to test the network's sensitivity to distribution drifts.
We discuss the impact of adversarial examples on security, safety, and robustness of neural networks.
arXiv Detail & Related papers (2020-08-07T08:51:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.