FIND: A Function Description Benchmark for Evaluating Interpretability
Methods
- URL: http://arxiv.org/abs/2309.03886v3
- Date: Fri, 8 Dec 2023 05:18:40 GMT
- Title: FIND: A Function Description Benchmark for Evaluating Interpretability
Methods
- Authors: Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil
Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba
- Abstract summary: This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating automated interpretability methods.
FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate.
We evaluate methods that use pretrained language models to produce descriptions of function behavior in natural language and code.
- Score: 86.80718559904854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Labeling neural network submodules with human-legible descriptions is useful
for many downstream tasks: such descriptions can surface failures, guide
interventions, and perhaps even explain important model behaviors. To date,
most mechanistic descriptions of trained networks have involved small models,
narrowly delimited phenomena, and large amounts of human labor. Labeling all
human-interpretable sub-computations in models of increasing size and
complexity will almost certainly require tools that can generate and validate
descriptions automatically. Recently, techniques that use learned models
in-the-loop for labeling have begun to gain traction, but methods for
evaluating their efficacy are limited and ad-hoc. How should we validate and
compare open-ended labeling tools? This paper introduces FIND (Function
INterpretation and Description), a benchmark suite for evaluating the building
blocks of automated interpretability methods. FIND contains functions that
resemble components of trained neural networks, and accompanying descriptions
of the kind we seek to generate. The functions span textual and numeric
domains, and involve a range of real-world complexities. We evaluate methods
that use pretrained language models (LMs) to produce descriptions of function
behavior in natural language and code. Additionally, we introduce a new
interactive method in which an Automated Interpretability Agent (AIA) generates
function descriptions. We find that an AIA, built from an LM with black-box
access to functions, can infer function structure, acting as a scientist by
forming hypotheses, proposing experiments, and updating descriptions in light
of new data. However, AIA descriptions tend to capture global function behavior
and miss local details. These results suggest that FIND will be useful for
evaluating more sophisticated interpretability methods before they are applied
to real-world models.
Related papers
- Adaptive Language-Guided Abstraction from Contrastive Explanations [53.48583372522492]
It is necessary to determine which features of the environment are relevant before determining how these features should be used to compute reward.
End-to-end methods for joint feature and reward learning often yield brittle reward functions that are sensitive to spurious state features.
This paper describes a method named ALGAE which alternates between using language models to iteratively identify human-meaningful features.
arXiv Detail & Related papers (2024-09-12T16:51:58Z) - Toward a Method to Generate Capability Ontologies from Natural Language Descriptions [43.06143768014157]
This contribution presents an innovative method to automate capability ontology modeling using Large Language Models (LLMs)
Our approach requires only a natural language description of a capability, which is then automatically inserted into a predefined prompt.
Our method greatly reduces manual effort, as only the initial natural language description and a final human review and possible correction are necessary.
arXiv Detail & Related papers (2024-06-12T07:41:44Z) - A Multimodal Automated Interpretability Agent [63.8551718480664]
MAIA is a system that uses neural models to automate neural model understanding tasks.
We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images.
We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.
arXiv Detail & Related papers (2024-04-22T17:55:11Z) - Actuarial Applications of Natural Language Processing Using
Transformers: Case Studies for Using Text Features in an Actuarial Context [0.0]
This tutorial demonstrates to incorporate text data into actuarial classification and regression tasks.
The main focus is on methods employing transformer-based models.
The case studies tackle challenges related to a multi-lingual setting and long input sequences.
arXiv Detail & Related papers (2022-06-04T15:39:30Z) - MACE: An Efficient Model-Agnostic Framework for Counterfactual
Explanation [132.77005365032468]
We propose a novel framework of Model-Agnostic Counterfactual Explanation (MACE)
In our MACE approach, we propose a novel RL-based method for finding good counterfactual examples and a gradient-less descent method for improving proximity.
Experiments on public datasets validate the effectiveness with better validity, sparsity and proximity.
arXiv Detail & Related papers (2022-05-31T04:57:06Z) - A Diagnostic Study of Explainability Techniques for Text Classification [52.879658637466605]
We develop a list of diagnostic properties for evaluating existing explainability techniques.
We compare the saliency scores assigned by the explainability techniques with human annotations of salient input regions to find relations between a model's performance and the agreement of its rationales with human ones.
arXiv Detail & Related papers (2020-09-25T12:01:53Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.