Fugu-MT 論文翻訳(概要): A Function Interpretation Benchmark for Evaluating Interpretability Methods

論文の概要: A Function Interpretation Benchmark for Evaluating Interpretability Methods

arxiv url: http://arxiv.org/abs/2309.03886v1
Date: Thu, 7 Sep 2023 17:47:26 GMT
ステータス: 翻訳完了
システム内更新日: 2023-09-08 11:58:19.707600
Title: A Function Interpretation Benchmark for Evaluating Interpretability Methods
Title（参考訳）: 解釈可能性評価のための関数解釈ベンチマーク
Authors: Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba
Abstract要約: 本稿では,自動解釈可能性評価のためのベンチマークスイートであるFIND(Function Interpretation and Description)を紹介する。 FINDには、トレーニングされたニューラルネットワークのコンポーネントに似た機能と、私たちが生成しようとしている種類の記述が含まれています。言語モデル(LM)を用いて関数の振る舞いのコードベースおよび言語記述を生成する新しい手法と既存手法を評価する。
参考スコア（独自算出の注目度）: 86.80718559904854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions are procedurally constructed across textual and numeric domains, and involve a range of real-world complexities, including noise, composition, approximation, and bias. We evaluate new and existing methods that use language models (LMs) to produce code-based and language descriptions of function behavior. We find that an off-the-shelf LM augmented with only black-box access to functions can sometimes infer their structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, LM-based descriptions tend to capture global function behavior and miss local corruptions. These results show that FIND will be useful for characterizing the performance of more sophisticated interpretability methods before they are applied to real-world models.
Abstract（参考訳）: このような記述は、失敗を表面化したり、介入をガイドしたり、重要なモデルの振る舞いを説明したりする。これまで、訓練されたネットワークの機械的な記述の多くは、小さなモデル、狭義に区切られた現象、そして大量の人的労働を伴う。サイズと複雑さの増大したモデルに、人間解釈可能なサブ計算をラベル付けするには、ほぼ確実に、記述を自動生成し検証するツールが必要である。近年,ループ内学習モデルを用いたラベリング手法が注目され始めているが,有効性を評価する手法は限られている。オープンなラベル付けツールの検証と比較には,どうすればよいのか? 本稿では,自動解釈手法の構成ブロックを評価するためのベンチマークスイートであるfind(function interpretation and description)を提案する。 FINDには、トレーニングされたニューラルネットワークのコンポーネントに似た機能と、私たちが生成しようとしている種類の記述が含まれています。関数は、テキストドメインと数値ドメインにまたがって手続き的に構築され、ノイズ、構成、近似、バイアスを含む様々な現実世界の複雑さを含む。言語モデル(LM)を用いて関数の振る舞いのコードベースおよび言語記述を生成する新しい手法と既存手法を評価する。機能へのブラックボックスアクセスのみを付加した既成のLMは、時にそれらの構造を推測し、仮説を形成し、実験を提案し、新しいデータに照らして記述を更新することによって科学者として機能する。しかし、lmに基づく記述はグローバル機能の挙動を捉え、局所的な腐敗を見逃す傾向がある。これらの結果から,FINDは実世界のモデルに適用する前に,より洗練された解釈可能性手法の性能を特徴付けるのに有用であることが示唆された。

論文の概要: A Function Interpretation Benchmark for Evaluating Interpretability Methods

関連論文リスト