Fugu-MT 論文翻訳(概要): Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

論文の概要: Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

arxiv url: http://arxiv.org/abs/2603.19209v1
Date: Thu, 19 Mar 2026 17:56:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:06.318339
Title: Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
Title（参考訳）: VLMはビジョントランスを必要とするか? ビジョンエンコーダとしての状態空間モデルの評価
Authors: Shang-Jui Ray Kuo, Paola Cascante-Bonilla,
Abstract要約: 大きな視覚言語モデル(VLM)は、しばしば凍結した視覚バックボーンを使用し、その画像特徴は軽量コネクタを通して大きな言語モデルにマッピングされる。トランスフォーマーベースのエンコーダが標準的な視覚バックボーンであるのに対し、状態空間モデル(SSM)ビジョンバックボーンが強力な代替品であるかどうかを問う。
参考スコア（独自算出の注目度）: 5.475609165327278
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.
Abstract（参考訳）: 大きな視覚言語モデル(VLM)は、しばしば凍結した視覚バックボーンを使用し、その画像特徴は軽量コネクタを通して大きな言語モデルにマッピングされる。トランスフォーマーベースのエンコーダが標準的な視覚バックボーンであるのに対し、状態空間モデル(SSM)ビジョンバックボーンが強力な代替品であるかどうかを問う。制御された環境下でのVLMのためのSSMビジョンバックボーンを体系的に評価した。一致したImageNet-1Kの初期化の下で、SSMバックボーンはVQAとグラウンド/ローカライゼーションの両方で最高の全体的なパフォーマンスを達成する。さらに,SSMおよびVTファミリーのバックボーンを検出およびセグメンテーション訓練により適用し,高密度タスクチューニングが家族間パフォーマンスを向上させることが確認された。私たちはそれをさらに観察する。 (i)ImageNetの精度が高いか、より大きなバックボーンが確実にVLMの性能に変換されない、そして (ii)いくつかの視覚的バックボーンは局所化において不安定である。これらの知見に基づき,両バックボーンの堅牢性を向上する安定化戦略を提案し,VLMにおけるトランスフォーマーベースの視覚エンコーダの強力な代替手段としてSSMバックボーンを強調した。

論文の概要: Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

関連論文リスト