Fugu-MT 論文翻訳(概要): Interpretability Transfer from Language to Vision via Sparse Autoencoders

論文の概要: Interpretability Transfer from Language to Vision via Sparse Autoencoders

arxiv url: http://arxiv.org/abs/2605.24946v1
Date: Sun, 24 May 2026 08:47:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.532835
Title: Interpretability Transfer from Language to Vision via Sparse Autoencoders
Title（参考訳）: スパースオートエンコーダによる言語から視覚への解釈可能性伝達
Authors: Alexey Kravets, Da Li, Chuan Li, Da Chen, Vinay P. Namboodiri,
Abstract要約: スパースオートエンコーダ(SAE)を用いた言語モデル解釈の最近の進歩は、視覚領域に効果的に翻訳されていない。 SAE Transfer Alignment (VISTA) は,LLaVAスタイルの視覚言語モデルにおいて,言語から視覚へ解釈可能性を伝達するフレームワークである。
参考スコア（独自算出の注目度）: 24.472985705517
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in language model interpretability using sparse autoencoders (SAEs) have yet to effectively translate to the visual domain, mainly due to the difficulty and ambiguity of labeling visual concepts. In this paper, we introduce Visual Interpretability via SAE Transfer Alignment (VISTA), a framework that transfers interpretability from language to vision in a LLaVA-style vision-language model by constraining a visual projector to map visual tokens into an LLM's pre-existing, labeled textual SAE space. This approach enables visual interpretability without training dedicated vision SAEs. By regularizing the projector using the LLM's SAE reconstruction loss, VISTA achieves a threefold increase in the matching rate, which measures how accurately the most activating textual concepts in the SAE space correspond to semantic elements in the image. Using this framework, we further analyze spatial localization properties of different vision encoders and show that DINOv2 features have stronger localization abilities than other encoders. Leveraging this precision, we validate VISTA's cross-modal alignment through fine-grained, localized concept interventions, where specific objects are removed or replaced in the model's perception while preserving the surrounding scene. This results in improvements of 35% in object removal and 47% in object replacement tasks over vision-only baselines, providing causal evidence that visual tokens inhabit the text SAE manifold. These contributions are validated across multiple LLM architectures.
Abstract（参考訳）: スパースオートエンコーダ(SAE)を用いた言語モデル解釈の最近の進歩は、視覚概念のラベル付けの難しさと曖昧さから、視覚領域に効果的に対応していない。本稿では、視覚プロジェクタをLLMの既存のラベル付きテキストSAE空間にマッピングすることで、LLaVAスタイルの視覚言語モデルにおいて、言語から視覚へ解釈可能性を伝達するフレームワークであるSAE Transfer Alignment (VISTA) を用いて視覚的解釈可能性を紹介する。このアプローチは、専用の視覚SAEを訓練することなく、視覚的解釈を可能にする。 LLMのSAE再構成損失を用いてプロジェクターを正規化することにより、VISTAはマッチングレートの3倍の増大を達成する。このフレームワークを用いて、異なる視覚エンコーダの空間的ローカライゼーション特性を解析し、DINOv2特徴が他のエンコーダよりも強いローカライゼーション能力を有することを示す。この精度を生かして、VISTAの細粒度で局所的な概念介入を通じて、周囲のシーンを保存しながら、特定の物体をモデルの知覚で取り除いたり、置き換えたりすることで、モーダルアライメントを検証する。その結果、視覚のみのベースラインよりも35%のオブジェクト除去と47%のオブジェクト置換タスクが改善され、視覚トークンがテキストSAE多様体に存在するという因果的証拠が得られた。これらの貢献は、複数のLLMアーキテクチャにまたがって検証される。

論文の概要: Interpretability Transfer from Language to Vision via Sparse Autoencoders

関連論文リスト