Fugu-MT 論文翻訳(概要): Can Graphs Help Vision SSMs See Better?

論文の概要: Can Graphs Help Vision SSMs See Better?

arxiv url: http://arxiv.org/abs/2605.11300v1
Date: Mon, 11 May 2026 22:40:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.460136
Title: Can Graphs Help Vision SSMs See Better?
Title（参考訳）: グラフはビジョンSSMをより良くするのに役立つか?
Authors: Dhruv Parikh, Anvitha Ramachandran, Haoyang Fan, Mustafa Munir, Rajgopal Kannan, Viktor Prasanna,
Abstract要約: 我々は、Vision SSM用のグラフ誘発動的走査演算子である textbfGraphScan を紹介する。それぞれのトークンに対して、GraphScanは空間的に有界な局所グラフを構築し、相対的な位置バイアスで特徴条件の親和性を学び、出力トークンを生成する。解析の結果,GraphScanはトークン格子上の解釈可能な変位場を誘導し,ダイナミックスキャニングのセマンティックで空間的に接地されたビューを提供することがわかった。
参考スコア（独自算出の注目度）: 8.221734233588085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision state space models inherit the efficiency and long-range modeling ability of Mamba-style selective scans. However, their performance depends critically on the representation of two-dimensional visual features as one-dimensional token sequences. Existing scan operators range from predefined geometric traversals to dynamic coordinate-based samplers that reroute tokens through predicted offsets and interpolation. While effective, these mechanisms primarily adapt paths or sampling locations, rather than explicitly modeling which local patches should exchange information before global state-space mixing. This motivates a simple question: \emph{can graphs help vision state space models see better?} We introduce \textbf{GraphScan}, a graph-induced dynamic scanning operator for Vision SSMs. For each token, GraphScan constructs a spatially bounded local graph, learns feature-conditioned affinities with relative positional bias, and produces the output token by one-step message passing over its semantic neighborhood. The resulting tokens are locally grounded before being processed by the selective SSM for global aggregation. GraphScan preserves token count and linear scaling in image size, while replacing coordinate-conditioned interpolation with feature-conditioned semantic routing. Integrated into a hierarchical backbone, \textbf{GraphScan-Mamba} achieves state-of-the-art performance among Vision SSMs across image classification, object detection, instance segmentation, and semantic segmentation, with modest computational overhead. Our analysis further shows that GraphScan induces interpretable displacement fields over the token lattice, providing a semantic and spatially grounded view of dynamic scanning. These results suggest that future Vision SSMs should treat scanning not merely as geometric serialization, but as learned local semantic routing before global state-space modeling.
Abstract（参考訳）: 視覚状態空間モデルは、マンバ型選択的スキャンの効率性と長距離モデリング能力を継承する。しかし、それらの性能は、二次元の視覚的特徴を1次元のトークンシーケンスとして表現することに大きく依存する。既存のスキャン演算子は、事前に定義された幾何学的トラバーサルから、予測されたオフセットと補間を通してトークンを反転させる動的座標に基づくサンプリングまで様々である。有効ではあるが、これらのメカニズムは、グローバルな状態空間の混合の前に、どのローカルパッチが情報を交換すべきかを明示的にモデル化するのではなく、主にパスやサンプリングロケーションに適応する。これは単純な質問を動機付けている: \emph{can graphs help vision state space models look better? グラフによるビジョンSSMの動的走査演算子であるtextbf{GraphScan}を紹介する。それぞれのトークンに対して、GraphScanは空間的に有界な局所グラフを構築し、相対的な位置バイアスで特徴条件の親和性を学び、そのセマンティックな近傍を通るワンステップメッセージによって出力トークンを生成する。得られたトークンは、グローバルアグリゲーションのための選択的なSSMによって処理される前に局所的に接地される。 GraphScanは、座標条件の補間を特徴条件のセマンティックルーティングに置き換えながら、画像サイズのトークン数と線形スケーリングを保存する。階層的なバックボーンに統合された \textbf{GraphScan-Mamba} は、画像分類、オブジェクト検出、インスタンスセグメンテーション、セマンティックセグメンテーションにまたがるビジョンSSM間の最先端のパフォーマンスを実現する。解析により,GraphScanはトークン格子上の解釈可能な変位場を誘導し,ダイナミックスキャニングのセマンティックで空間的に接地されたビューを提供することが示された。これらの結果は、将来のビジョンSSMは、走査を幾何学的シリアライゼーションとして扱うだけでなく、グローバルな状態空間モデリングの前に学習された局所的な意味的ルーティングとして扱うべきであることを示唆している。

論文の概要: Can Graphs Help Vision SSMs See Better?

関連論文リスト