Fugu-MT 論文翻訳(概要): Rethink MAE with Linear Time-Invariant Dynamics

論文の概要: Rethink MAE with Linear Time-Invariant Dynamics

arxiv url: http://arxiv.org/abs/2605.00915v1
Date: Wed, 29 Apr 2026 15:07:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.479671
Title: Rethink MAE with Linear Time-Invariant Dynamics
Title（参考訳）: 線形時間不変ダイナミクスを用いたMAEの再考
Authors: Zice Wang,
Abstract要約: 凍結した視覚表現において,トークンの順序は重要かつ悪用可能な次元であることを示す。状態空間モデルによって駆動される探索フレームワークであるSSMProbeを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Standard representation probing for visual models relies on mathematically permutation-invariant operations like Global Average Pooling (GAP) or CLS tokens, treating patch representations as an unstructured bag-of-words. We challenge this paradigm by demonstrating that token order is a critical, exploitable dimension in frozen visual representations (e.g., MAE, BEiT, DINOv2, and ViT as CLS-ablation extreme). We propose SSMProbe, a probing framework driven by a State Space Model (SSM). Operating as discrete Linear Time-Invariant (LTI) dynamical systems, SSMs act as permutation-sensitive probes where sequence order strictly dictates the final state due to inherent memory decay. Formulating token ordering as an information scheduling problem, we compare fixed scan heuristics against a differentiable soft permutation (Sinkhorn-based) learned from downstream supervision. Evaluations on standard and fine-grained classification benchmarks reveal a striking order gap: while fixed scans fail dramatically on highly localized patch features, our learned soft permutation successfully extracts highly competitive performance from otherwise heavily localized patch sequences. We find that pre-training objectives fundamentally shape token structure: DINOv2 concentrates global semantics in optimized CLS tokens leaving patches hyperspecialized, pure MAE preserves distributed representations with heterogeneous patch informativeness, and ViT represents a supervised CLS-dominated extreme. BEiT occupies middle ground. This heterogeneity is order-dependent -- meaning the SSM probe's performance depends critically on which tokens are placed at which temporal positions -- and is not merely a topological property of the spatial grid. SSMProbe's learned routing effectively discovers and exploits this heterogeneity, offering a powerful new diagnostic lens for visual representation analysis.
Abstract（参考訳）: ビジュアルモデルの標準的な表現探索は、Global Average Pooling (GAP) や CLS トークンのような数学的に置換不変な操作に依存し、パッチ表現を非構造化のバッグ・オブ・ワードとして扱う。我々は,トークン秩序が凍結した視覚表現(例えば,MAE,BEiT,DINOv2,ViT)における臨界かつ悪用可能な次元であることを示すことによって,このパラダイムに挑戦する。本研究では,国家空間モデル(SSM)によって駆動される探索フレームワークであるSSMProbeを提案する。離散線形時間不変(LTI)力学系として動作し、SSMは置換感受性プローブとして機能する。情報スケジューリング問題としてのトークンオーダの定式化では,下流の監視から学習したソフトな置換(シンクホーンベース)と固定スキャンヒューリスティックスを比較した。固定スキャンは高度に局所化されたパッチ機能では劇的に失敗するが、我々の学習したソフトな置換は高度に局所化されたパッチシーケンスから高い競争性能を抽出することに成功した。 DINOv2は、最適化されたCLSトークンのグローバルセマンティクスに集中し、パッチを過剰に特定し、純粋なMAEは異種パッチ情報による分散表現を保存し、ViTはCLSが支配する極端に監督された極端な表現を表す。 BeiTは中盤を占拠している。この不均一性は順序に依存し、つまりSSMプローブのパフォーマンスは、時間的位置のトークンがどの位置に置かれるかに決定的に依存し、単に空間格子の位相的性質であるわけではない。 SSMProbeの学習したルーティングは、この不均一性を効果的に発見し、活用し、視覚的表現分析のための強力な新しい診断レンズを提供する。

論文の概要: Rethink MAE with Linear Time-Invariant Dynamics

関連論文リスト