The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms
- URL: http://arxiv.org/abs/2408.05859v1
- Date: Sun, 11 Aug 2024 20:50:16 GMT
- Title: The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms
- Authors: Adam Davies, Ashkan Khakzar,
- Abstract summary: mechanistic interpretability (MI) has emerged as a distinct research area studying the features and implicit algorithms learned by foundation models such as large language models.
We argue that current methods are ripe to facilitate a transition in deep learning interpretation echoing the "cognitive revolution" in 20th-century psychology.
We propose a taxonomy mirroring key parallels in computational neuroscience to describe two broad categories of MI research.
- Score: 3.3653074379567096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Artificial neural networks have long been understood as "black boxes": though we know their computation graphs and learned parameters, the knowledge encoded by these weights and functions they perform are not inherently interpretable. As such, from the early days of deep learning, there have been efforts to explain these models' behavior and understand them internally; and recently, mechanistic interpretability (MI) has emerged as a distinct research area studying the features and implicit algorithms learned by foundation models such as large language models. In this work, we aim to ground MI in the context of cognitive science, which has long struggled with analogous questions in studying and explaining the behavior of "black box" intelligent systems like the human brain. We leverage several important ideas and developments in the history of cognitive science to disentangle divergent objectives in MI and indicate a clear path forward. First, we argue that current methods are ripe to facilitate a transition in deep learning interpretation echoing the "cognitive revolution" in 20th-century psychology that shifted the study of human psychology from pure behaviorism toward mental representations and processing. Second, we propose a taxonomy mirroring key parallels in computational neuroscience to describe two broad categories of MI research, semantic interpretation (what latent representations are learned and used) and algorithmic interpretation (what operations are performed over representations) to elucidate their divergent goals and objects of study. Finally, we elaborate the parallels and distinctions between various approaches in both categories, analyze the respective strengths and weaknesses of representative works, clarify underlying assumptions, outline key challenges, and discuss the possibility of unifying these modes of interpretation under a common framework.
Related papers
- Neuro-Symbolic AI: Explainability, Challenges, and Future Trends [26.656105779121308]
This article proposes a classification for explainability by considering both model design and behavior of 191 studies from 2013.
We classify them into five categories by considering whether the form of bridging the representation differences is readable.
We put forward suggestions for future research in three aspects: unified representations, enhancing model explainability, ethical considerations, and social impact.
arXiv Detail & Related papers (2024-11-07T02:54:35Z) - Hierarchical Invariance for Robust and Interpretable Vision Tasks at Larger Scales [54.78115855552886]
We show how to construct over-complete invariants with a Convolutional Neural Networks (CNN)-like hierarchical architecture.
With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner.
For robust and interpretable vision tasks at larger scales, hierarchical invariant representation can be considered as an effective alternative to traditional CNN and invariants.
arXiv Detail & Related papers (2024-02-23T16:50:07Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - Machine Psychology [54.287802134327485]
We argue that a fruitful direction for research is engaging large language models in behavioral experiments inspired by psychology.
We highlight theoretical perspectives, experimental paradigms, and computational analysis techniques that this approach brings to the table.
It paves the way for a "machine psychology" for generative artificial intelligence (AI) that goes beyond performance benchmarks.
arXiv Detail & Related papers (2023-03-24T13:24:41Z) - Rejecting Cognitivism: Computational Phenomenology for Deep Learning [5.070542698701158]
We propose a non-representationalist framework for deep learning relying on a novel method: computational phenomenology.
We reject the modern cognitivist interpretation of deep learning, according to which artificial neural networks encode representations of external entities.
arXiv Detail & Related papers (2023-02-16T20:05:06Z) - Mapping Knowledge Representations to Concepts: A Review and New
Perspectives [0.6875312133832078]
This review focuses on research that aims to associate internal representations with human understandable concepts.
We find this taxonomy and theories of causality, useful for understanding what can be expected, and not expected, from neural network explanations.
The analysis additionally uncovers an ambiguity in the reviewed literature related to the goal of model explainability.
arXiv Detail & Related papers (2022-12-31T12:56:12Z) - Interpreting Neural Policies with Disentangled Tree Representations [58.769048492254555]
We study interpretability of compact neural policies through the lens of disentangled representation.
We leverage decision trees to obtain factors of variation for disentanglement in robot learning.
We introduce interpretability metrics that measure disentanglement of learned neural dynamics.
arXiv Detail & Related papers (2022-10-13T01:10:41Z) - Local Interpretations for Explainable Natural Language Processing: A Survey [5.717407321642629]
This work investigates various methods to improve the interpretability of deep neural networks for Natural Language Processing (NLP) tasks.
We provide a comprehensive discussion on the definition of the term interpretability and its various aspects at the beginning of this work.
arXiv Detail & Related papers (2021-03-20T02:28:33Z) - Interpretable Deep Learning: Interpretations, Interpretability,
Trustworthiness, and Beyond [49.93153180169685]
We introduce and clarify two basic concepts-interpretations and interpretability-that people usually get confused.
We elaborate the design of several recent interpretation algorithms, from different perspectives, through proposing a new taxonomy.
We summarize the existing work in evaluating models' interpretability using "trustworthy" interpretation algorithms.
arXiv Detail & Related papers (2021-03-19T08:40:30Z) - Neuro-symbolic Architectures for Context Understanding [59.899606495602406]
We propose the use of hybrid AI methodology as a framework for combining the strengths of data-driven and knowledge-driven approaches.
Specifically, we inherit the concept of neuro-symbolism as a way of using knowledge-bases to guide the learning progress of deep neural networks.
arXiv Detail & Related papers (2020-03-09T15:04:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.