Fugu-MT 論文翻訳(概要): Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

論文の概要: Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

arxiv url: http://arxiv.org/abs/2603.19183v1
Date: Thu, 19 Mar 2026 17:42:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:57.000057
Title: Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models
Title（参考訳）: VLAモデルにおけるスパースオートエンコーダの解釈と安定性
Authors: Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy, Mac Schwager,
Abstract要約: VLA(Vision-Language-Action)モデルが汎用ロボット操作のための有望なアプローチとして登場した。 VLAモデルの内部動作をよりよく理解するために,機械的解釈可能性技術を適用した。本研究では,個々のSAEがロボット行動に因果的に影響を及ぼすことを示す。
参考スコア（独自算出の注目度）: 7.1750939299528795
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations of the VLA. SAEs learn a sparse dictionary whose features act as a compact, interpretable basis for the model's computation. We find that the large majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features correspond to interpretable, general, and steerable motion primitives and semantic properties, offering a promising glimpse toward VLA generalizability. We propose a metric to categorize features according to whether they represent generalizable transferable primitives or episode-specific memorization. We validate these findings through steering experiments on the LIBERO benchmark. We show that individual SAE features causally influence robot behavior. Steering general features induces behaviors consistent with their semantic meaning and can be applied across tasks and scenes. This work provides the first mechanistic evidence that VLAs can learn generalizable features across tasks and scenes. We observe that supervised fine-tuning on small robotics datasets disproportionately amplifies memorization. In contrast, training on larger, more diverse datasets (e.g., DROID) or using knowledge insulation promotes more general features. We provide an open-source codebase and user-friendly interface for activation collection, SAE training, and feature steering. Our project page is located at http://drvla.github.io
Abstract（参考訳）: VLA(Vision-Language-Action)モデルが汎用ロボット操作のための有望なアプローチとして登場した。しかし、それらの一般化は矛盾するものであり、これらのモデルはいくつかの設定で驚くべき性能を発揮するが、微調整された変種は、しばしば新しいオブジェクト、シーン、命令で失敗する。 VLAモデルの内部動作をよりよく理解するために,機械的解釈可能性技術を適用した。内部表現を探索するために、VLAの隠された層活性化に対してスパースオートエンコーダ(SAE)を訓練する。 SAEは、モデル計算のコンパクトで解釈可能な基盤として機能するスパース辞書を学ぶ。その結果,抽出されたSAE特徴の大部分は,特定の訓練実験の暗記シーケンスに対応していることがわかった。しかしながら、いくつかの特徴は解釈可能で、汎用的で、ステアブルなモーションプリミティブとセマンティックな性質に対応しており、VLAの一般化に向けて有望な予感を与えている。一般化可能な転送可能なプリミティブを表すか、エピソード固有の記憶を表すかに応じて特徴を分類する指標を提案する。 LIBEROベンチマークのステアリング実験により,これらの知見を検証した。本研究では,個々のSAEがロボット行動に因果的に影響を及ぼすことを示す。ステアリングの一般的な特徴は、その意味的な意味と一致した振る舞いを誘導し、タスクやシーンにまたがって適用することができる。この研究は、VLAがタスクやシーンにまたがって一般化可能な特徴を学習できるという最初の力学的な証拠を提供する。小型ロボティクスデータセットの教師付き微調整が暗記を不当に増幅するのを観察する。対照的に、より大きく多様なデータセット(例えば、DROID)のトレーニングや知識絶縁の使用は、より一般的な特徴を促進する。私たちはオープンソースのコードベースと、アクティベーションコレクション、SAEトレーニング、機能ステアリングのためのユーザフレンドリーなインターフェースを提供しています。私たちのプロジェクトページはhttp://drvla.github.ioにあります。

論文の概要: Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

関連論文リスト