Fugu-MT 論文翻訳(概要): Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations

論文の概要: Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations

arxiv url: http://arxiv.org/abs/2606.24716v1
Date: Tue, 23 Jun 2026 15:39:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:49.041439
Title: Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations
Title（参考訳）: 概念アノテーションを用いたスパースオートエンコーダの解釈可能性の評価
Authors: Jonas Klotz, Cassio F. Dantas, Pallavi Jain, Diego Marcos, Begüm Demir,
Abstract要約: 本研究では,SAE潜伏剤と人間注釈概念のアライメントを定量化する人為的評価フレームワークを提案する。対象属性の摂動によるマッチングを検証する。我々のフレームワークは、中程度の辞書サイズが最良のトレードオフをもたらし、最も解釈可能なSAEをもたらすことを示唆している。
参考スコア（独自算出の注目度）: 12.535445487099393
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sparse autoencoders (SAEs) are increasingly used to extract interpretable concepts from vision and vision language models, yet existing evaluation methods largely rely on proxy metrics or qualitative inspection rather than measuring semantic correspondence. We present a human-grounded evaluation framework that quantifies alignment between SAE latents and human-annotated concepts, without requiring user studies, and validate this matching through targeted attribute perturbations. To enable this intervention-style evaluation in vision, we construct synCUB and synCOCO, synthetic benchmarks of paired images that differ in exactly one attribute. We introduce Fully-Binary Matching Pursuit (FBMP), a coalition-based matching procedure that supports many-to-one mappings between SAE latents and annotated concepts, and consistently outperforms one-to-one baselines. For functional validation, we propose a Targeted Attribute Perturbation Alignment Score (TAPAScore), which tests whether matched concepts respond selectively and in the expected direction under targeted image-level attribute perturbations. Under sanity checks, our matching and TAPAScore are the only evaluated metrics that reliably distinguish trained SAEs from untrained ones. Across SAEs trained on CLIP and DINOv2 embeddings, we find that increased overcompleteness can reduce perturbation alignment, indicating a reduction in interpretability. Our evaluation framework suggests that moderate dictionary sizes provide the best trade-off, yielding the most interpretable SAEs. Code and datasets are available at https://github.com/JonasKlotz/sae-concept-eval.
Abstract（参考訳）: スパースオートエンコーダ(SAE)は視覚言語モデルや視覚言語モデルから解釈可能な概念を抽出するために用いられることが多いが、既存の評価手法は意味的対応を測るよりも、プロキシメトリクスや定性検査に大きく依存している。本研究では,SAE潜伏者と人間アノテーション概念のアライメントをユーザスタディを必要とせずに定量化し,このマッチングを属性摂動によって検証する人為的評価フレームワークを提案する。視覚におけるこの介入スタイルの評価を可能にするために、正確に1つの属性が異なるペア画像の合成ベンチマークであるsynCUBとsynCOCOを構築した。我々は,SAE潜伏者と注釈付き概念の多対一マッピングをサポートする連立型マッチング手法であるFully-Binary Matching Pursuit (FBMP)を導入し,一対一のベースラインを一貫して上回っている。機能検証のためのTAPAScore(Targeted Attribute Perturbation Alignment Score)を提案する。衛生チェックの下では、トレーニングされたSAEとトレーニングされていないSAEを確実に区別する唯一の評価指標は、マッチングとTAAAScoreです。 CLIPとDINOv2の埋め込みを訓練したSAE全体で、オーバーコンプリートの増加は摂動アライメントを減少させ、解釈可能性の低下を示唆している。我々の評価フレームワークは、中程度の辞書サイズが最良のトレードオフをもたらし、最も解釈可能なSAEをもたらすことを示唆している。コードとデータセットはhttps://github.com/JonasKlotz/sae-concept-eval.comで公開されている。

論文の概要: Evaluating the Interpretability of Sparse Autoencoders with Concept Annotations

関連論文リスト