Fugu-MT 論文翻訳(概要): FIND: A Function Description Benchmark for Evaluating Interpretability Methods

論文の概要: FIND: A Function Description Benchmark for Evaluating Interpretability Methods

arxiv url: http://arxiv.org/abs/2309.03886v3
Date: Fri, 8 Dec 2023 05:18:40 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-11 18:27:23.133323
Title: FIND: A Function Description Benchmark for Evaluating Interpretability Methods
Title（参考訳）: FIND:解釈可能性評価のための関数記述ベンチマーク
Authors: Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba
Abstract要約: 本稿では,自動解釈可能性評価のためのベンチマークスイートであるFIND(Function Interpretation and Description)を紹介する。 FINDには、トレーニングされたニューラルネットワークのコンポーネントに似た機能と、私たちが生成しようとしている種類の記述が含まれています。本研究では、事前訓練された言語モデルを用いて、自然言語とコードにおける関数の振る舞いの記述を生成する手法を評価する。
参考スコア（独自算出の注目度）: 86.80718559904854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions span textual and numeric domains, and involve a range of real-world complexities. We evaluate methods that use pretrained language models (LMs) to produce descriptions of function behavior in natural language and code. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built from an LM with black-box access to functions, can infer function structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, AIA descriptions tend to capture global function behavior and miss local details. These results suggest that FIND will be useful for evaluating more sophisticated interpretability methods before they are applied to real-world models.
Abstract（参考訳）: このような記述は、失敗を表面化したり、介入をガイドしたり、重要なモデルの振る舞いを説明したりする。これまで、訓練されたネットワークの機械的な記述の多くは、小さなモデル、狭義に区切られた現象、そして大量の人的労働を伴う。サイズと複雑さの増大したモデルに、人間解釈可能なサブ計算をラベル付けするには、ほぼ確実に、記述を自動生成し検証するツールが必要である。近年,ループ内学習モデルを用いたラベリング手法が注目され始めているが,有効性を評価する手法は限られている。オープンなラベル付けツールの検証と比較には,どうすればよいのか? 本稿では,自動解釈手法の構成ブロックを評価するためのベンチマークスイートであるfind(function interpretation and description)を提案する。 FINDには、トレーニングされたニューラルネットワークのコンポーネントに似た機能と、私たちが生成しようとしている種類の記述が含まれています。関数はテキストドメインと数値ドメインにまたがり、様々な現実世界の複雑さを含む。本研究では,事前学習された言語モデル(LM)を用いて,自然言語やコードにおける関数の振る舞いを記述する手法を評価する。さらに,AIA(Automated Interpretability Agent)が関数記述を生成するインタラクティブな手法を提案する。関数へのブラックボックスアクセスを持つLMから構築されたAIAは、機能構造を推論し、仮説を定式化し、実験を提案し、新しいデータに照らして記述を更新することで科学者として振る舞うことができる。しかし、AIAの記述はグローバル関数の振る舞いを捉え、局所的な詳細を見逃す傾向がある。これらの結果から,FINDは実世界のモデルに適用する前に,より洗練された解釈可能性の評価に有用であることが示唆された。

論文の概要: FIND: A Function Description Benchmark for Evaluating Interpretability Methods

関連論文リスト